Computer Languages History
(Click or use local copy.)
 CS 3723/3721
 Programming Languages
 Spring 2005

 Recitation 3
 Lexical Analysis or Scanning
    Week 3: Jan 31-Feb 4
 Due (on time): 2005-02-07  23:59:59
 Due (late):        2005-02-11  23:59:59

Recitation 3 must be submitted following directions at: submissions with deadlines
  • 2005-02-07  23:59:59 (that's Monday, 7 February 2005, 11:59:59 pm) for full credit.
  • 2005-02-11  23:59:59 (that's Friday, 11 February 2005, 11:59:59 pm) for 75% credit.


Overview: The first stage of a compiler uses a scanner or lexical analyzer to break the source program into a sequence of basic units called tokens. The scanner reads one character at a time. The scanner also recognizes and discards comments and keeps track of line numbers. The different types of tokens recognized are:


Writing a recognizer for individual tokens: There are automated software tools to produce a lexical analyzer or scanner, such as lex in Unix, but you are not to use any such tool here. The StringTokenizer class that most of you saw in Java is another tool, but it is much too simple -- you should not use it since it would be completely inadequate for this task. It's use is now discouraged in Java in favor of the package: java.util.regex, with classes Pattern (a compiled representation of a regular expression) and Matcher (an engine that matches a character string by interpreting a Pattern). This package yields the full power of regular expressions. There is yet another class in Java, StreamTokenizer, that could also theoretically be used for this recitation. You should not use any of these tools. We are studying the ideas behind their implementation.


Writing a Scanner, or Lexical Analyzer: The terms "scanner" and "lexical analyzer" mean the same thing: a single program that will return the next input token each time it it called on to do so. As needed, the program will input individual characters. Such a scanner just tacks together recognizers for each legal type of token. The scanner also discards comments and keeps track of line numbers. You should have a "driver" function that repeatedly calls the scanner (until eof or a "$" sign as discussed below) and prints the returned token.

Stated another way, a scanner converts a simple input stream of characters into a simple output stream of tokens.


Recitation work: Writing a full scanner for a language like C or Java is not too hard (it is especially easy conceptually), but it involves an enormous number of details. Instead, in this recitation you are to write a simple scanner that nevertheless includes the main ideas. Your scanner should be written in Java or C++ and it should handle the following tokens:

  1. Identifiers: These should be a letter followed by any number (including zero) of letters or digits. These could later include reserved words. (You can allow an underscore if you wish.)
  2. Floating point constants, including integers: These should include integer constants as a simple special case. They should not include any initial plus or minus sign, which will be a separate token.
  3. Operators and special symbols: All other characters, except for whitespace characters and the characters in a comment, will be returned as 1-character tokens. (For technical reasons explained below, your source should not include the characters @ or #, and you may have a single $ at the end if you wish.)

Because of the third "catch-all" type of token above, your scanner will not need any error messages. It should accept any sequences of tokens at all, including ones that would be insane in any kind of a program. (The software will be catching syntax errors later on.) Here are additional requirements for the scanner:

  1. Lexical convensions: Any two tokens of types 1 and 2 above must be separated by at least a comment or 1 whitespace character, or any combination of more of them. The scanner must skip over whitespace and comments while looking for a token. The only allowed comments are C-style comments: /* */
  2. Line numbers: Your scanner should keep track of the input line number by counting the newlines that it sees (using Unix-style line separators).
  3. Identifiers: For an identifier, the scanner should return a class with a field holding a @ character (see below), along with another field giving the characters of the identifier (the identifier as a string).
  4. Floating point and integer constants: For floating point or integer constants, the software should convert the string of characters that represent the constant into an actual internal double. Then the scanner should return a class with a field holding a # character (see below), along with another field giving the value of the constant as a double.
  5. What the scanner returns: The scanner returns a class with three fields:

    Notice that single letters will look like identifiers, unless they are the e in a floating point constant, and single digits will look like an integer constants, unless they are part of an identifier. For identifiers, your scanner should return a @ character, which will then not be allowed on input. For integer and floating point constants, your scanner should return a # character, which again will then not be allowed on input.


What you should submit: Refer to the submissions directions and to deadlines at the top of this page. The text file that you submit should first have Your Name, the Course Number, and the Recitation Number. The rest of the file should have the following in it, in the order below, and clearly labeled, including at the beginning the appropriate item letters: a, b, and c. (In case you do c you may omit b.)

 Contents of submission for Recitation 3:

Last Name, First Name; Course Number; Recitation Number (3).

a. The Java or C++ source files for your scanner. Everything should be run together into one file, with reasonable separators between components (the separate source files). The code should be reasonably organized and written, with special emphasis on header comments. (Not much emphasis on inline comments.)

b. You should give the results of a run using the following simple source file for input. This might be a goal for an initial version of your scanner.

    
    rate = 14;
    time = 2;
    distance = rate * time;
    cost = 6*distance;
    $
    

c. You should give the results of a run using the following more complex source file for the input. (If you do this part, you can skip part b above.)

    
    rate = 14.3; /* initial rate is 14.3 */
    time0 = 2.;  /* initial time is 2.0 */
    distance = rate * time0; /* distance is rate times time */
    cost2005 =/***/.6e-2*distance/* */+/**/2e2; /* cost for 2005 */
    $ /* optional, in case you need an eof sentinel */
    

Output format: In c above, your output might look like the following. (Note that I generated the output below by hand, so it might have errors in it.)

    
    Token   idVal      doubleVal     Line Number
    
    @       rate       -             1
    =       -          -             1
    #       -          14.3          1
    ;       -          -             1
    @       time0      -             2
    =       -          -             2
    #       -          2.0           2
    ;       -          -             2
    @       distance   -             3
    =       -          -             3
    @       rate       -             3
    *       -          -             3
    @       time0      -             3
    ;       -          -             3
    @       cost2005   -             4
    =       -          -             4
    #       -          0.0060        4
    *       -          -             4
    @       distance   -             4
    +       -          -             4
    #       -          200.0         4
    ;       -          -             4
    


Key ideas: The tokens that are recognized by the lexical level of a programming language can be described by FSMs. These FSMs can be used to write the code for an actual recognizer (a program that recognizes and returns a legal sequence of tokens, without worrying about whether the tokens make any sense).


Revision date: 2005-01-30. (Please use ISO 8601, the International Standard.)