CS 5363 Programming Languages and Compilers
Scanner


Overview: The first stage of a compiler uses a scanner or lexical analyzer to break the source program into a sequence of basic unit called tokens. The scanner also recognizes and discards comments and keeps track of line numbers. The different types of tokens recognized are:

Writing a recognizer for indiviual tokens: The most common way to write such a recognizer is to first write a finite state machine (FSM) that describes the token. Then the FSM is converted to a program, either by hand or using a automated software tool to produce the program (such as lex in unix).

C-style comments: C-style comments, material between and initial /* and a terminal */ form a comment. A recognizer for these constructs is illustrated here: C-style comments.

Tokens for this assignment:

The tokens listed above were defined earlier except for the floating point constants, which consist initially of any number (including 0) of digits, followed by an optional decimal point(.), followed by any number (including 0) of digits, followed by an optional exponent part, which is an e or E, followed by an optional sign (+ or -), followed by 1 or more digits for the exponent. There must be at least one digit before or after the decimal point (or both), and if there is no decimal point there must be at least one digit. There must be either a decimal point or the exponent part (or both).

Actual floating point constants in Java can also have an optional trailing "f" or "F" for "float" constants and the optional trailing "d" or "D" for "double" constants. (With no optional trailing letter the constant is double by default.) You should ignore these possibilities. You should also ignore the limitation on the size of the exponent, so the last illegal constant in the program below would be accepted by your FSM. Finally, the initial optional sign (+ or -) would be treated as a separate operator.

The Program for this assignment: For this assignment you are to write a program in C, C++, or Java that reads source containing the above tokens and returns each token in turn. The source will have the four types of tokens intermingled, along with comments, blanks, and other whitespace which should be ignored.

You can create a FSM that describes the various tokens above. Only the floating point constants as described above is complicated. This token must be done with some care to cover the various possibilities mentioned above. You can either write a "pure" FSM (fairly complicated, with quite a few states), or a FSM with additional constraints written in, such as "at least one of this or that". (This is easier, but less formal.)

As part of this assignment, there must be "semantic" code along with the code that recogizes a floating point constant, so that your program will produce the value (as a double) of the string of characters describing a floating point constant. This value must be calculated "from scratch" without using any fancy C, C++, or Java library functions.

Similarly, the part that handles identifiers must have extra semantic code that looks up the identifier is a symbol table and inserts it if it is not there. (You should keep your symbol table simple: just an array of strings, and sequential search will be fine.)

If the actual code for your scanner is called scan, then you will need an extra "driver" program that just repeatedly calls scan and prints the returned token, until end-of-file. (If you have trouble with end-of-file, just use a '#' character to mark the end.

In order to return the token from the scanner, you could use a pointer to a struct (in C), a pointer to a class (in C++), or a class (in Java). In all three cases, the struct or class could contain a character, a double, and a string. In case a double is to be returned, set the char to 'd' and the double to the return value. In case of an identifier, set the char to 'i' and the string to the characters of the identifier. All other cases consist of a single-character token, and in this case the value of the double or string does not matter.

What to turn in: You should turn in a the source code and a run of your program using the sample input source here, or as shown below under "Test Input". The run should print each token in turn. It would be nice to also print your symbol table.

Hints on calculating the value of a scanned double: In C it would be possible to use a library function to convert a string of characters representing a floating point number to an actual double. You could use sscanf for example. In ordinary code it is usually a good idea to use library functions rather than rewrite them from scratch. For this assignment, however, we are studying some of the low-level language mechanisms, and here you are not to use the library function. Instead, you should realize that given the code:

the variable i takes on the integer value 4. At each stage of reading the initial digits of the constant, you can multipy by 10 and add in the next integer value. Then you have to take the "." and the exponent part appropriately into account.

Java also has library functions (such as Double.doubleValue) that should not be used. The same trick above also works in Java.

Test Input:


Revision date: 2002-08-27. (Use ISO 8601, an International Standard.)