Computer Languages History
(Click or use local copy.)
 CS 3723/3721
 Programming Languages
 Fall 2004

 Recitation 3
 Lexical Analysis or Scanners
    Week 3: Sep 8-10
 Due (on time): 2004-09-13  23:59:59
 Due (late):        2004-09-17  23:59:59

Recitation 3 must be submitted following directions at: submissions with deadlines
  • 2004-09-13  23:59:59 (that's Monday, 13 September 2004, 11:59:59 pm) for full credit.
  • 2004-09-17  23:59:59 (that's Friday, 17 September 2004, 11:59:59 pm) for 75% credit.


Overview: The first stage of a compiler uses a scanner or lexical analyzer to break the source program into a sequence of basic units called tokens. The scanner reads one character at a time. The scanner also recognizes and discards comments and keeps track of line numbers. The different types of tokens recognized are:


Writing a recognizer for individual tokens: There are automated software tools to produce a lexical analyzer or scanner, such as lex in Unix, but you are not to use any such tool here. The StringTokenizer class that most of you saw in Java is another tool, but it is much too simple -- you should not use it since it would be completely inadequate for this task. (There is another class in Java, StreamTokenizer, that is more powerful and could theoretically be used for this recitation, but you should not use it either. We are studying the ideas behind the implementation of this class.)


Writing a Scanner, or Lexical Analyzer: The terms "scanner" and "lexical analyzer" mean the same thing: a single program that will return the next input token each time it it called on to do so. As needed, the program will input individual characters. Such a scanner just tacks together recognizers for each legal type of token. The scanner also discards comments and keeps track of line numbers. You should have a "driver" function that repeatedly calls the scanner (until eof or a "$" sign as discussed below) and prints the returned token.

Stated another way, a scanner converts a simple input stream of characters into a simple output stream of tokens.


Recitation work: Writing a full scanner for a language like C or Java is not too hard (it is especially easy conceptually), but it involves an enormous number of details. Instead, in this recitation you are to write a simple scanner that nevertheless includes the main ideas. Your scanner should be written in Java or C++ and it should handle the following tokens:

  1. Identifiers: These should be a letter followed by any number (including zero) of letters or digits. These could later include reserved words. (You can allow an underscore if you wish.)
  2. Floating point constants, including integers: These should include integer constants as a simple special case. They should not include any initial plus or minus sign, which will be a separate token.
  3. Operators and special symbols: All other characters, except for whitespace characters and the characters in a comment, will be returned as 1-character tokens. (For technical reasons explained below, your source should not include the characters "@" or "#", and you may have a single "$" at the end if you wish.)

Because of the third "catch-all" type of token above, your scanner will not need any error messages. It should accept any sequences of tokens at all, including ones that would be insane in any kind of a program. (The software will be catching syntax errors later on.) Here are additional requirements for the scanner:

  1. Lexical convensions: Any two tokens of types 1 and 2 above must be separated by at least a comment or 1 whitespace character, or any combination of more of them. The scanner must skip over whitespace and comments while looking for a token. The only allowed comments are C-style comments: /* */
  2. Line numbers: Your scanner should keep track of the input line number by counting the newlines that it sees.
  3. Symbol table: Each identifier that your program encounters must be entered into a table of strings, called the symbol table. This symbol table can just be a simple array-based unordered list of strings. In Java or C++ you should not use one of the container classes, but must write this from scratch. I recommend a very simple array-based table, with a fixed maximum size, written as simply as possible. The code for the symbol table must be a separate class in Java or in C++. You need to be able to enter a new identifier and to look an identifier up in the table. There is no need for deletions. For an identifier, the scanner should return a "@" character (see below), and set a global varible equal to an index or pointer giving its location in the symbol table.
  4. Floating point and integer constants: For floating point or integer constants, the software should convert the string of characters that represent the constant into an actual internal double. Then the scanner should return a "#" character (see below) and set a global variable equal to the value of the constant as a double.
  5. What the scanner returns: For single-character constants that are neither identifiers nor constants, the scanner should just return the character. Notice that single letters will look like identifiers, unless they are the "e" in a floating point constant, and single digits will look like an integer constants, unless they are part of an identifier. For identifiers, your scanner should return a "@" character, which will then not be allowed on input. For integer and floating point constants, your scanner should return a "#" character, which again will then not be allowed on input.
  6. Handling returns correctly: The approach above for returning a token just returns a single character and uses global variables for constants and identifiers. This is not how one should program, but instead the scanner should return a class representing a token. This class would include as one field the character above (the special character token, or an "@" for an identifier or a "#" for a constant). In the simple case here there would be another field for the possible index or pointer into the symbol table (in case of an identifier) and a third field for a double as the value (in case of a constant).


What you should submit: Refer to the submissions directions and to deadlines at the top of this page. The text file that you submit should first have Your Name, the Course Number, and the Recitation Number. The rest of the file should have the following in it, in the order below, and clearly labeled, including at the beginning the appropriate item letters: a, b, and c. (In case you do c you may omit b.)

 Contents of submission for Recitation 3:

Last Name, First Name; Course Number; Recitation Number (3).

a. The Java or C++ source files for your scanner. Everything should be run together into one file, with reasonable separators between components (the separate source files). The code should be reasonably organized and written, with special emphasis on header comments. (Not much emphasis on inline comments.)

b. You should give the results of a run using the following simple source file for input. This might be a goal for an initial version of your scanner.

    
    rate = 14;
    time = 2;
    distance = rate * time;
    cost = 6*distance;
    $
    

c. You should give the results of a run using the following more complex source file for the input. (If you do this part, you can skip part b above.)

    
    rate = 14.3; /* initial rate */
    time0 = 2.;  /* initial time */
    distance = rate * time0; /* distance is rate times time */
    cost2004 =/***/.6e-2*distance/* */+/**/2e2; /* cost for 2004 */
    $ /* optional, in case you need an eof sentinel */
    

Output format: In b and c above, your output should be each scanned token, in case of an identifier the symbol table location, and in case of a constant, the value as a double, followed finally by the line number. At the end you must also print the symbol table. Thus the output for c above might look like the following. (Below, I put in extra spaces for clarity, but you don't need them. Note that I generated the output below by hand, so it might have errors in it.)

    
    Token Value     Line Number
    @     0         1
    =     -         1
    #     14.3      1
    ;     -         1
    @     1         2
    =     -         2
    #     2.0       2
    ;     -         2
    @     2         3
    =     -         3
    @     0         3
    *     -         3
    @     1         3
    ;     -         3
    @     3         4
    =     -         4
    #     0.0060    4
    *     -         4
    @     2         4
    +     -         4
    #     200.0     4
    ;     -         4
    
    Symbol Table
    0   rate
    1   time0
    2   distance
    3   cost2004
    


Key ideas: The tokens that are recognized by the lexical level of a programming language can be described by FSMs. These FSMs can be used to write the code for an actual recognizer (a program that recognizes and returns a legal sequence of tokens, without worrying about whether the tokens make any sense).


Revision date: 2004-06-29. (Please use ISO 8601, the International Standard.)