CS 5363: The TL05 Language
Project Overview
In this class, you will be writing an optimizing compiler for a
toy programming language, which we will call, TL05 (and targeting the
SPIM MIPS Simulator.)
This project is tentatively divided into six parts. The first three involve,
work on the front-end (Scanning and Parsing, type-checking, IR design
and generation), and the last three involve work on optimization and
code generation.
The TL05 programming language, is based on the P2K language used by
Michael Franz in his Advanced Compiler Construction class at the University
of California, Irvine. The P2K language was a simplified subset of
Pascal, and TL05 has been simplified even further, so that you can
experience writing an optimizing compiler without getting burdened by
all of the complexities and details of a complete, standard programming
language.
Language Features
Lexical Features
The TL05 language, is lexically simple. All lexical items are
to be separated by whitespace (space, tab, or return). All identifiers
start with a lower case letter, and may contain only numbers and lower
case letters. All key words are made of a capital letters. In
addition the symbols "[", "]", "(", ")", ":=", and ";" are used.
Data Types
TL05 supports 32-bit integers ("INT"), booleans ("BOOL"), arrays
of these, and arrays of arrays of these, etc. The syntax for an
array type is "ARRAY num OF ", where num is the size of the
array and is the type of the elements of that array. Variables
are always declared to be of a particular type. There is also
a formal set of type rules
Operators
TL05 has several infix binary operators that work on either integer
operands. The multiplication "MUL", division "DIV", modulus "MOD",
addition "PLUS", and subtraction "MINUS" produce integer results.
The comparison operators (i.e., equals "EQ", not equal "NE", less than
"LT", less-than or equal-to "LTE", greater than "GT", and greater-than
or equal-to "GTE") all produce boolean results.
Control Structures
TL05 is a structured programming language. The only control structures
supported are IF and WHILE statements. Both take a boolean expression
that guards the body of the control structure. In the case of an IF
statement, the statements after the THEN are executed if the expression
is true, and the statements after the ELSE (if there is one) are executed
if the expression is false. In the case of the WHILE statement,
the loop is exited if the expression false; otherwise if the
expression is true, the body will be executed, and then the expression
will be re-evaluated.
Assignment
Assignments are a kind of statement rather than a kind of operator.
The ":=" keyword is used to separate the left hand side (which is
the variable or array element being assigned to) from the right hand
side, which is an expression that must be of the same type as the
left hand side.
Built-in Procedures
TL05 does not support user-defined functions or procedures, but it
does support two built-in procedures WRITEINT and WRITELN that
output an integer or a new-line to the console (respectively), and
one user-defined function, READINT that reads an integer
from the console. The syntax for these is hard-coded into TL05's
BNF grammar.
Lexical Elements
Note: If the definition of a lexical element is in quotes, then it
is meant to match exactly, the contained string. Otherwise, it is
a regular expression. Square brackets in regular expressions are
used as an abbreviation for matching ranges of letters. For example,
[0-9] matches any digit, and [a-zA-Z] matches any English letter
in capital or lower case.
Numbers, Literals, and Identifiers:
- num = 0|([1-9][0-9]*)
- lit = 0|([1-9][0-9]*)|(-[1-9][0-9]*)|(FALSE)|(TRUE)
- ident = [a-z][a-z0-9]*
Symbols:
- LB = "["
- RB = "]"
- LP = "("
- RP = ")"
- ASGN = ":="
- SC = ";"
Operators:
- OP2 = "MUL" | "DIV" | "MOD"
- OP3 = "PLUS" | "MINUS"
- OP4 = "EQ" | "NE" | "LT" | "GT" | "LTE" | "GTE"
Keywords:
- IF = "IF"
- THEN = "THEN"
- ELSE = "ELSE"
- BEGIN = "BEGIN"
- END = "END"
- WHILE = "WHILE"
- DO = "DO"
- PROGRAM = "PROGRAM"
- VAR = "VAR"
- AS = "AS"
- ARRAY = "ARRAY"
- OF = "OF"
- INT = "INT"
- BOOL = "BOOL"
Built-in Procedures:
- WRITEINT = "WRITEINT"
- WRITELN = "WRITELN"
- READINT = "READINT"
BNF Grammar
<program> ::= PROGRAM ident <declarations> BEGIN <statementSequence> END
<declarations> ::= VAR ident AS <type> SC <declarations>
| ε
<type> ::= ARRAY num OF INT
| ARRAY num OF BOOL
| INT
| BOOL
<statementSequence> ::= <statement> SC <statementSequence>
| ε
<statement> ::= <assignment>
| <ifStatement>
| <whileStatement>
| <writeInt>
| <writeLn>
<assignment> ::= <memCell> ASGN <expression>
| <memCell> ASGN READINT
<memCell> ::= ident
| ident LB <expression> RB
<ifStatement> ::= IF <expression> THEN <statementSequence> <elseClause> END
<elseClause> ::= ELSE <statementSequence>
| ε
<whileStatement> ::= WHILE <expression> DO <statementSequence> END
<writeInt> ::= WRITEINT <expression>
<writeLn> ::= WRITELN
<expression> ::= <simpleExpression>
| <simpleExpression> OP4 <simpleExpression>
<simpleExpression> ::= <term> OP3 <term>
| <term>
<term> ::= <factor> OP2 <factor>
| <factor>
<factor> ::= <memCell>
| lit
| LP <expression> RP
Informal Semantics
- Only those variables which have been declared can be assigned
to or used.
- All INT variables and array elements are considered to have
initial values of "0".
- All BOOL variables and array elements are considered to have
initial values of "false".
- All binary operators operate on signed integer operands:
- "x MUL y" results in the product of x and y.
- "x DIV y" which results in the integer quotient of x divided by y
- "x MOD y" is the results in the remainder of x divided by y.
(such that if r = x MOD y; and q = x DIV y, then x = y * q + r,
|r| < |y| and the sign of r is the sign of x.)
- "x PLUS y" results in the sum of x and y.
- "x MINUS y" is the difference of y subtracted from x.
- "x EQ y" is true if x and y are the same, otherwise it is false.
- "x NE y" is false if x and y are the same, otherwise it is true.
- "x LT y" is true if x is less than y, otherwise it is false.
- "x GT y" is true if x is greater than y, otherwise it is false.
- "x LTE y" is true if x is less than or equal to y, otherwise it is false.
- "x GTE y" is true if x is greater than or equal to y, otherwise it is false.
- An assignment updates the current value of the variable or array element
denoted by its left-hand side to be the value resulting from evaluating
the right-hand side.
- IF statements evaluate their expression, if the expression is true,
then the "then-statements" are executed, if it is fales, the
"else-statements" are executed.
- WHILE statements first evaulates its expression. If it is false,
execution continues after the end of the WHILE statement. If
it is true, the statements in the body of the WHILE loop are
executed. After they finish executing, the expression is
re-evaluated. As long as the expression is true, the process
repeats itself, alternatively evaluating the expression and
executing the statements in the body. Once the expression is false,
execution continues after the end of the WHILE loop.
- WRITEINT evaluates its expression and outputs the result to the console.
- WRITELN causes the console to move its cursor to the
beginning of the next line.
- READINT reads an integer from the console and updates an integer
variable to hold that value.
- Valid array indices 0-based and less than the declared bounds.
If, at runtime, a statement's index expression of a variable is is not a
non-negative number less than the bounds declared for that variable,
the program should abort with an error message.
Errata/Clarifications
- The ε's (epsilons) indicating empty alternatives
did not print-out on the handouts distributed in class.
- The lexical category of "num" is a subset of "lit". This prevents
the scanner from being able to disambiguate these two (at least--without
help from the parser).
- Normally, tokens/words/terminals should be separated by whitespace,
but you may elect to recognize the symbols LB "[", RB "]", LP "(",
RP ")", ASGN ":=", and SC ";" as if they were surrounded by
whitespace even when they are adjacent to other tokens. If you choose
to do this, you should note this in the email accompanying part #1 of
your project.
- I changed the production, so that only single dimmensional arrays are possible.