Reducing Field Table Size
in the Icon Translator
Richard Hatch and Clinton L. Jeffery
Technical Report
CS-95-14
December 16, 1995
Abstract
Division of Computer Science
The University of Texas at San
Antonio
San Antonio, TX 78255
Implementors of a programming language face many complex tasks and must prioritize their work. As an implementation matures, areas of needed improvement come up during use. Also, when a language is used for larger-scale projects than it was designed for, naive implementations may cause problems. This paper explores solutions to a problem with the implementation of records in the Icon programming language that only came up when the language was used in a large (25k lines) program. Icon is a very high level language developed at the University of Arizona[2].
The current version of the Icon translator uses a table to identify which fields are in which records when an Icon program uses records. The structure of this table, called the field table, is an N by M array when there are M records with a total of N field names. This table contains the field offset, an index into an array of slots in the record, for each field in each record type. For example, in the field table below, in record type a the field named m is declared first. So its field offset within the record is listed as 0. The field named n is declared next so the field offset for that field is 1, and so on. If a record does not contain a field, an offset value of -1 is used to indicate this. In this way, it is easy to see if record X has a field named Y by looking at the Yth by Xth entry of the field table. If the value there is non-negative, then the field exists within the record. If the value is -1, then the field does not exist within the record.
For example, the following set of Icon record declarations
record a (m, n, o, p) record b (o, p, q, r) record c (s, t, u, m)result in this field table, which we will use in later examples:
| a | b | c | |
|---|---|---|---|
| m | 0 | -1 | 3 |
| n | 1 | -1 | -1 |
| o | 2 | 0 | -1 |
| p | 3 | 1 | -1 |
| q | -1 | 2 | -1 |
| r | -1 | 3 | -1 |
| s | -1 | -1 | 0 |
| t | -1 | -1 | 1 |
| u | -1 | -1 | 2 |
Each row represents a field name found in one or more record types, represented by the columns. From this table, it can be shown that the field named n occurs only in record type a. It can also be shown that the field names o and p occur in both record types a and b. Finally, we know that record type c does not contain a field named o because of the value of -1 in the table at the intersection of row o and column c.
The Icon translator generates this table and outputs it as part of the code used by the interpreter, called icode. For programs with a large number of records, the field table grows very large. Since it is highly unlikely that several record types will share common field names, the non-negative values sparsely populate the field table. For example, for a program with 3 record types using a total of 9 field names (as in the example given above), the field table has 3(9)=27 entries. Of these, only 12 identify the existence of a field (because there are 12 field names in the record declarations). It is worth pointing out that if none of the field names had been reused, there would have been additional rows in the field table, but the same number of entries would have been non-negative, creating an even more sparsely populated table.
This study examines possible methods of reducing the amount of space required for the field table and using that space more efficiently.
One method that can be used to reduce the amount of space required by the field table will be referred to in this study as the squish algorithm. This algorithm uses the same method of identifying whether a field exists within a given record type as the original field table, that is, non-negative values still represent the absence of a field in a record. However, in this algorithm, the translator concatenates, or squishes, all the field rows together after stripping the initial and trailing negative values from each row of the original field table, leaving both the middle negative and non-negative values unaltered.
To show the effect this has on the field table, we consider its effect on the example given in the introduction. The new field table, which has been transformed into a 1-D array with auxiliary bookkeeping info, is shown here. Rows from the original field table have had their leading and trailing negative values removed and then the rows are laid end-to-end. The field names are printed under the first element of the array that applies to that field.
| m | n | o | p | q | r | s | t | u | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| field array | 0 | -1 | 3 | 1 | 2 | 0 | 3 | 1 | 2 | 3 | 0 | 1 | 2 |
This algorithm reduces the size of the field table by eliminating some of the negative values from the table. The only negative values remaining in the table are those included in a row between two other non-negative values, as in the first field of the example. However, the example also shows that when this type of compression is done, some additional information is required to make the new array useful. This additional information can be contained within three new arrays.
| m | n | o | p | q | r | s | t | u | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| field offset | 0 | 3 | 4 | 6 | 8 | 9 | 10 | 11 | 12 | ||||
| record offset | 0 | 0 | 0 | 0 | 1 | 1 | 2 | 2 | 2 | ||||
| field length | 3 | 1 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | ||||
The first of these additional arrays will contain the field offset for each field name. That is, at what index in the new field array does the information for each field start? If each row of the original field table is stripped of its beginning and ending negative values and concatenated end to end, we can no longer simply look at the row and column intersection as if it were a conventional table. Therefore, the field offset array will contain the information needed to find the same "row" of information in the new field table.
Also, since the negative values are being removed from the beginning of each
field row, we keep track of the number of elements removed so that we can
calculate the correct record number for each entry. This is called the record
offset. For example, if we were to determine whether record type c
contains a field named u (it does) the field offset value gives us a
starting index of 12 in the field array and type c's index = 12 + 2 - 2 =
12, from the equation fo[field] + record - ro[field]. Note that the test
ro[field] <= record must be true or the field reference is
invalid because the record number references one of the -1 we stripped off the
front of the field table.
Finally, since the trailing -1 entries in each row have been eliminated, these entries must be accounted for. This can be done by simply noting how many of the array elements are valid for a given field name. This will be called the field length. This value is used only to check the validity of a value obtained by the other two arrays. If the record number is greater than record offset plus the field length, the value retrieved was not for the correct record, and the value is invalid and the field does not exist in the record.
To determine how well the squish algorithm performs, the size of the icode generated by the Icon translator is shown in the table below for several Icon programs which use records extensively.
| Program | Total icode | Field table | # records | # fields | Density % |
|---|---|---|---|---|---|
| ged | 239013 | 3256 | 11 | 74 | 12.04 |
| vib | 443145 | 14036 | 29 | 121 | 9.86 |
| idol | 103883 | 5600 | 25 | 56 | 11.50 |
| Freedom | 1731684 | 464784 | 138 | 842 | 1.20 |
| Program | Total icode | Field table | Total field info | Density (%) |
|---|---|---|---|---|
| ged | 237125 | 476 | 1368 | 82.35 |
| vib | 432829 | 2264 | 3720 | 61.13 |
| idol | 100531 | 1572 | 2248 | 40.97 |
| Freedom | 1296268 | 19260 | 29368 | 28.85 |
Both ged and vib are in the standard ipl program library. They are GUI programs that use the Icon widgets. Idol is an object oriented extension to the Icon language developed by the second author. Freedom in the Galaxy is a program that was developed by a Software Engineering class at the University of Texas at San Antonio. It too is a GUI program and uses more records than any other Icon program tested.
Using the squish algorithm eliminates many unused (-1) field table entries, increasing densities on our example programs from 1-12% to 29-82%. The next logical step in reducing space is to compress the field rows together. This can be done by treating each row as a piece of a one-dimensional jigsaw puzzle, a technique developed for selector table indexing in object-oriented systems[1]. Going back to the field table example first presented in the introduction, notice how "holes" of negative values can be present in the field table, as in the first field row of the example (field m in records a and c). Given a larger number of record types, these holes become large. Notice also, that the hole has not been eliminated in the description of the squish algorithm since only the leading and trailing negative values are dealt with.
The jigsaw algorithm not only strips the leading and trailing negative values from each field row, but it also attempts to find a location among the field rows that have already been placed in which the non-negative values of the current "piece," or field row, will overlap and fit into the negative values in the field array. For this to work properly, a field offset array must still be present so that it can be determined which values belong to which records. Also, since the field rows are now overlapping, there must be some way to tell if a value really belongs to a field and record combination. This is done by having a bit map for each field row, each containing the appropriate number of bits to account for all record types. The jigsaw representation of our ongoing example is given below. First, the original field table is shown, but each row has been shifted to the right until its non-negative values "fit" into the holes caused by the negative values in all the rows above it. Under each field row is the corresponding bit map for that field row. Then, all the rows are collapsed into the jigsaw field array with the field name listed under the element it belongs to and the field offset that is used to find the correct row of information for the field. Below each offset is the name of the element that uses that offset. For example, field rows r and s both have the same offset, but since their bit maps differ no information is lost.
| a | b | c | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| m | 0 | -1 | 3 | |||||||||
| bitmap[m] | 1 | 0 | 1 | |||||||||
| n | 1 | -1 | -1 | |||||||||
| bitmap[n] | 1 | 0 | 0 | |||||||||
| o | 2 | 0 | -1 | |||||||||
| bitmap[o] | 1 | 1 | 0 | |||||||||
| p | 3 | 1 | -1 | |||||||||
| bitmap[p] | 1 | 1 | 0 | |||||||||
| q | -1 | 2 | -1 | |||||||||
| bitmap[q] | 0 | 1 | 0 | |||||||||
| r | -1 | 3 | -1 | |||||||||
| bitmap[r] | 0 | 1 | 0 | |||||||||
| s | -1 | -1 | 0 | |||||||||
| bitmap[s] | 0 | 0 | 1 | |||||||||
| t | -1 | -1 | 1 | |||||||||
| bitmap[t] | 0 | 0 | 1 | |||||||||
| u | -1 | -1 | 2 | |||||||||
| bitmap[u] | 0 | 0 | 1 | |||||||||
| 0 | 1 | 3 | 2 | 0 | 3 | 1 | 2 | 3 | 0 | 1 | 2 | |
| owner | m | n | m | o | o | p | p | q | r | s | t | u |
| offset | 0 | 1 | 3 | 5 | 6 | 7 | 8 | 9 | ||||
| offset header | m | n | o | p | q | r, s | t | u |
We can now determine if a field is part of a given record by first looking at the field offset array to determine where in the field array to look for its record information. From there, a value is retrieved by going to the recordth value from the field offset. Then the bit map is checked to see if the value retrieved belongs to this field row. For example, we can see that record c has a field u because the cth element from the u offset is non-negative. Furthermore, we know that this value is valid because there is a 1 in u's bit map for that location. Similarly, we can see that there is no u field in the b record type, even though we retrieve a non-negative value from the field array, because of the 0 in u's bit map in the bth position.
The size of the tables generated by the jigsaw algorithm are listed in the table below for the same programs used to evaluate the squish algorithm. The field table is almost full, and the auxiliary arrays used by the jigsaw algorithm become the dominating source of space consumption; the space complexity is still N by M due to the bitmap but units are in bits instead of words.
| Program | Total icode | Field table | Total field info | Density (%) |
|---|---|---|---|---|
| ged | 236601 | 392 | 844 | 100.00 |
| vib | 431369 | 1384 | 2360 | 100.00 |
| idol | 99403 | 664 | 1120 | 99.07 |
| Freedom | 1290988 | 5556 | 24088 | 100.00 |
This study was successful in reducing the amount of icode generated by the
Icon translator by reducing the size of the field table while having almost no
effect on execution time. Below is a table showing average execution times for
ten runs of the Idol compiler compiling itself with Icon 9.0 and the Squish and
Jigsaw algorithms. Idol uses a very large number of field table accesses and
represents a worst-case measure. Typical programs are unaffected. Timings were
obtained by the UNIX time(1) command.
| Algorithm | User time | System time | Total time |
|---|---|---|---|
| Original | 2.07 | 0.80 | 2.87 |
| Squish | 2.14 | 0.81 | 2.95 |
| Jigsaw | 2.09 | 0.83 | 2.92 |
This table suggests slight increases in execution time over the original algorithm when using the squish algorithm (2.8%) or the jigsaw algorithm (1.7%). This small amount of extra execution time is due to using the extra offset arrays required by the compression methods described in this paper. In the original algorithm, the Icon interpreter only had to retrieve the required value from the field table. The jigsaw algorithm has to not only retrieve this information, but find the array offset within the field table where the information is located, and then check the bit map array to check its validity. The squish algorithm also has to locate the information needed in the field table by looking up the field offset, but then it must calculate the position of the required information from that position using the record offset array. Lastly, the squish algorithm must validate the information retrieved.
It should be noted that the amount of time spent in this part of the Icon interpreter will depend on how heavily the record types declared in an Icon program are accessed. If the record types are not heavily used, then the portion of the Icon interpreter that was modified in this study is not executed very often and does not have much of an effect on execution time. The Idol translator used in the table above is an example of a program that uses field accesses heavily.
To demonstrate how this method performs, we calculate the effect it has on Freedom in the Galaxy. 8-bit integers can be used for the field table entries, since the largest record contains 51 fields, and 16-bit integers for the field offset array because there are 1389 entries in the field table after compressing it. This saves a total of 5851 bytes in addition to the 440696 bytes already eliminated by compressing the field table.
An effective compression of the Icon translator's field tables was implemented. It has little effect on overall execution time. For Icon programs with a moderate number of records, such as ged, the size of the field table, and therefore the overall reduction in icode size, is small. In ged's case, only 1.0%. For programs with a larger number of records, such as vib and idol, the savings are a bit more substantial, 2.7% and 4.3% respectively. Field table compression is primarily targeted at programs with a very large number of record types, where the savings are enormous. For example, in the case of Freedom in the Galaxy, a 25.4% reduction in total icode size (over 400 kilobytes) was obtained.
Additional savings can be obtained from the field table, but not much. Transposing the rows and columns of the tables would make the field offset array proportional to the number of records instead of the number of fields. For example, in Freedom in the Galaxy, 138 16-bit entries could replace 842, resulting in 1408 more bytes saved, but it would not be worth the effort required. The bitmap being the dominating space consumer in the jigsaw algorithm, it would be interesting to try an idea that comes from parser generators such as the UNIXyacc(1) program: identical rows, in
this case in the bitmap array, could be merged. But with only a percent or two
overall icode savings possible even in the worst case, is is probably not worth
the effort.
[1] Karel Driesen, "Selector Table Indexing & Sparse Arrays," Conference Proceedings of the OOPSLA 1993 Eighth Annual Conference on Object-Oriented Programming Systems, Languages, and Applications 26 September - 1 October 1993, Washington, DC, USA,ACM SIGPLAN Notices, vol. 28, number 10, pp 259-270, 1993.
[2] Ralph E. Griswold and Madge T. Griswold, The Icon Programming Language, Prentice Hall, 1990.
Reducing Field Table Size in the Icon Translator / UTSA / rhatch@ringer.cs.utsa.edu