Five years ago, marketing research indicated that the time was right for an
MS-DOS BASIC compatible compiler for UNIX.[1]
Basmark mobilized to satisfy this market. The world was much different then:
UNIX itself was a small market but due for explosive growth. UNIX was still
largely academic where BASIC was looked upon with disdain, even contempt. But
the growth of UNIX would be commercial, into MS-DOS territory, where BASIC was a
staple. The expectation was that customers would like to upgrade from MS-DOS to
UNIX and take their BASIC programs with them. The UNIX hardware market was also
different: fragmented. Many vendors sold UNIX boxes, but none had an especially
large market share. The original Basmark BASIC compiler was designed in this
environment.
Since then many things have changed: the heralded growth of UNIX has not materialized although the predictions persist. An alternative theory might be that the growth has in fact occurred but that the initial market was so small that it is only now large enough to be the basis for an “explosion.” Certainly no system of comparable power has usurped UNIX. The UNIX hardware market has become skewed, ironically, toward MS-DOS compatible machines. Competing products have also emerged although they have tended to either converge on the MS-DOS compatible market or diverge from the de facto standard language: MS-DOS BASIC. And the word on the streets regarding the quality of these products is not favorable. This leaves the original market largely intact and available. Finally BASIC on MS-DOS itself has been replaced by QuickBASIC.
Basmark BASIC has grown and evolved with the real world and will continue to be responsive to the customer.
The market environment being well understood, the following is the description of the major decisions of the design and implementation of the original compiler and its successors.
Where is MS-DOS BASIC defined? This seemed to be a good place to start; unfortunately, when discrepancies appear between the documentation and the implementation, the manual and the compiler, one tends to favor the manual.[2] It was soon discovered that, paradoxically, the first question was: Is MS-DOS BASIC defined? Or which came first the language or the definition? In fact, BASIC is a “living language” much like English. Most languages are defined somewhere on paper and their compilers are more or less faithful implementations. BASIC, however, is a language which a compiler accepts and its manual “is a tale told by an idiot, full of sound and fury, signifying nothing.”[3]
Once the meaning of “BASIC” was identified, MS-DOS compatibility became a problem.[4] Things like adding and conditional statements are abstract language constructs and the means to implement were well known. However, BASIC as the front line language on the microcomputer is the press agent for the hardware. The language contains many functions the sole purpose of which is to interface with whatever unique hardware the machine has to offer. In fact, customers rarely use these bells and whistles probably because they translate so poorly to other systems. So necessity and practicality dictated that the Basmark compiler be compatible only in terms of universal language features.
While hardware specific features do not translate at all and abstract constructs translate easily, a gray area composed especially of the operating system* interface require some modification to translate and were difficult to implement. In fact, the difficulty was initially assessed and the simplest was implemented first.
The final Basmark compiler contained all plausibly implementable features and even a few gleaned from extensions of the MS-DOS BASIC language implemented in other environments.
Of course UNIX specific extensions could have been designed and included but this would have contradicted and confused the product’s purpose although customization for the aggrandizement of a manufacturer’s machine was foreseen and facilitated.
* The computer science sense of “operating system” is used here. As opposed to the colloquial sense which includes the user interface.
Given the state of compiler theory and the availability of compiler construction tools under UNIX, the use of LEX and especially YACC as the focal point of development became clear.[5] A prototypical analyzer (front end) was outlined with emphasis on the production rules of the parser. A driver program was envisioned to manage files and invoke this analyzer and the components of a back end.
What is the product (target) of compilation? Executable code is the final product, but a compiler need not exist in isolation under UNIX. Assembly or object code is sufficient; the BASIC driver program can, as standard UNIX language drivers do, invoke the available UNIX assembler and loader.[6] Object code (properly formatted) can be linked with UNIX libraries. Executable and object code formats, unfortunately, differ from machine to machine and are cryptic and documentation is often unavailable. Except between machines with different processors, assembly code formats do not differ much and documentation is readily available from the processor manufacturer.[7] At worst, the wily programmer can glean the format from assembly code generated by the properly coaxed native C compiler. Therefore, the target language of the compiler was determined to be assembly code although the driver program would provide for the usual assembling and loading.
Besides producing a reasonably well documented target language, this approach creates a boon of compatibility. By using the appropriate calling sequence, routines from other languages can be linked in. The advantages are three. Code becomes available from existing UNIX libraries which contain a UNIX operating system interface and many low level and machine dependent routines. Complex run-time support routines can be coded in C (rather than machine dependent assembler). Finally the calling sequence can be made available at the source level so the BASIC programmer can call routines written in other languages. Thus a code generator was proposed to produce simple assembly language: in-line code to implement simple operations such as arithmetic and subroutine calling code for non-trivial functions. In fact, MS-DOS BASIC is a very high level language in that the preponderance of its statements call for complex operations. A large body of code would be needed to implement these operations; this was dubbed the run-time support library.
Construction of these components commenced in haste. Soon, the size of the analyzer (front end) approached the limits of a 16-bit address machine—an important marketability constraint—primarily due to the size of the tables generated by LEX and YACC. So the front end was divided into two passes, a lexical analyzer and a parser (with semantics).
Although the interfaces between the various components was reasonably well understood, the widest gulf appeared between the front and back ends, the parser and the code generator. The two are very different conceptually and, not surprisingly, a variety of interfaces have been proposed although the advantages of any one to a particular application can be quite subtle. Rather than fret with choices that would probably require empirical study, the most general method, quadruples, was chosen for the intermediate language and the simplest, hash table based, symbol table was chosen. Design meetings consisted primarily of negotiations over this interface.
The initial organization, then, consisted of a lexical analyzer, parser, code generator and run-time support library interfaced with a token stream, symbol table, intermediate language and subroutine calling sequence. The entire process would be controlled by a driver program. Although the entire compiler would be compiled in C, the driver program would be a mere UNIX shell procedure until programmer time became less critical.[8]
Having made LEX the focal point of the lexical analyzer, concerns regarding the efficiency of this process vanished and initial coding was reduced to merely cataloging the tokens.[9] The resulting “code” read more like a grocery list than a program. Unfortunately, this elegance would be muddied due to BASIC’s origin as a interpreter rather than a compiler.
The volume of reserved words had exponential consequences in the deterministic finite automaton constructed by LEX and so these were delegated to the symbol table. As an aid to the parser, function names were distinguished from other reserved words although a few words are used in both contexts. Context is the domain of the parser but the lexical analyzer needed to examine context in these few cases. A string literal or remark (comment) may appear in any language and can be difficult to express in terms of a regular expression because it has an internal “grammar,” but well known kludges exist for these exceptions. Unfortunately, a BASIC remark can contain meta-commands. Furthermore, the language also contains default type and DATA statements and line oriented constructs each with its own “grammar.” These necessitated special handling in the lexical analyzer and the parser and a more complex token stream.
Finally, a compilation option was invented to turn off case sensitivity. MS-DOS BASIC is not case sensitive although code is often stored in upper case only; UNIX languages, by convention, expect keywords in lower case only. The option was implemented to convert upper case to lower case while reading characters from the BASIC source. Unfortunately, LEX only reluctantly allows modifications of the input stream. Conversion of literal text was avoided by special handling discussed above.
As the hub of the lexical analyzer was LEX, the hub of the parser became YACC.[10] This compiler development tool generates an LALR(1) parser from BNF production rules. An action is coded in C augmented with facilities to manipulate the parser stack. Not only is coding made easy and clear, YACC also solves the first problem in the Herculean task of constructing a parser with semantics, namely where to begin. One can start with a subset of the full grammar; get that working reasonably well and add features piecemeal. As alluded to earlier, BASIC has a lot of statements that are more features than language constructs. That these statements were grammatically self-contained and would not have an impact on the rest of the grammar was also understood, so their implementation was postponed.
But YACC is best for a highly structured grammar such as that of PASCAL.[11] When the myriad simple statements were finally implemented, a high price had to be paid in terms of the size of the parse table. Yet while the core of the grammar fit the YACC model reasonably well, this core was discovered to consist of little more than the expressions.
The most structured BASIC statement, the conditional, seemed to be easy to describe with YACC. However, it suffers the “dangling-else” problem and the peculiarity that its clauses can be line numbers and/or compound statements.
In fact, the combination of a line number and a compound statement produces unreachable code and complicates the grammar. Not surprisingly, this construct was overlooked until it appeared in field testing. Field testing also revealed that the conditional is the one and only structured BASIC statement. The assumption that the looping statements are structured was shown to be false. As a result looping statements became grammatically simple statements. The remaining structure (mainly pairing of the beginning and end of the loop) was then enforced by semantic actions and a special purpose stack.
Despite these difficulties, YACC earned its keep simply organizing the statements and did an excellent job with expressions, creating an efficient parser from code that could be written clearly.
The intended purpose of the semantic actions was the straightforward generation of quadruples. The operations defined in this intermediate language included all the traditional operations augmented by a set invented to implement some features peculiar to BASIC.
As the semantic actions took shape, however, the generation of quadruples was seen to comprise a surprisingly small portion of the code. Instead, type checking and conversion and the creation of temporary variables for this purpose overwhelmed the semantic actions, although this was in part an artifact of the structured coding techniques employed. Of course type conversion could have been delegated to the code generator; however, the code generator had no other reason to create variables. Rather its purpose was to deal with machine architectures. Type conversion by the code generator was rejected as poor cohesion.
The construction of the first code generator was trivial. It merely generated data space allocation directives based on the symbol table and Motorola 68000 instructions based on the quadruples.[7] Marketing demanded that a functional compiler be demonstratable as soon as possible. Optimization was out of the question; correct code was needed.
Arrays had a surprisingly strong impact on the code generator. Although BASIC has no pointers, array lookup necessitated their existence so they were designed into the symbol table. Just as a sequence of “param” quadruples followed by a “call” is a common intermediate language representation of a subroutine call, so a sequence of “subscript” quadruples followed by an “array” was devised to represent array lookup. A general system of pointer arithmetic would have been overkill. Nevertheless, the result of array lookup would be a pointer. This creates the implication that the pointer may appear in the intermediate language anywhere an ordinary variable may. Virtually every type of quadruple is effected although the assembly language implementation is merely an extra level of indirection.
Although the first code generator had been trivial, the development computer was soon upgraded to a VAX. The code generator was rewritten in only two days but now there was twice as much code to maintain. Later the compiler was ported to an Intel 80286 based system—and now there were three. The result was not unexpected but soon the once trivial code generator had become the largest compiler component of all. Fortunately, subsequent ports were to systems with similar processors and assembly language syntaxes. The minor differences were maintained with the conditional compilation constructs of the C preprocessor.
The run-time support library is the repository of the high level run-time behavior of BASIC. These features are complicated, independent and pedestrian compared to compiler design. The exceptions are I/O features which are highly interconnected and obviously crucial, so they were implemented first.
A special purpose file description structure was devised so that random and sequential access methods could be handled by the same low level routines. An underlying system was developed similar to that of the C standard I/O library in which single characters have primacy. Upon this foundation were built routines analogous to the BASIC I/O statements. Later during development the UNIX terminal driver was found to be at odds with some of the details of BASIC I/O. The driver needed to be turned off at run-time and, of course, turned back on upon termination. Unfortunately, no simple switch exists; instead the state of the driver must be preserved during execution. The necessary space was allocated in the file description structure and the code was upgraded to handle terminals as well as files. This effectively created a generalization of MS-DOS BASIC where the console and disk files are not interchangeable as they are under UNIX. To complete this generalization, the grammar was upgraded.
Besides I/O, the most important run-time behavior is error recovery although the volume of code required is relatively small. Error recovery, therefore, was placed high on the list of priorities. In fact, all the needed functions were separated into seven categories including a miscellany. This organization has been maintained ever since and is consulted wherever a new function is added to the language.
Besides I/O, the most voluminous category is the miscellany composed primarily of program chaining, file directory listing and terminal capabilities database accessing routines. While this code would be very simple under MS-DOS, it is extremely complex under UNIX. The presence of these routines in the Basmark compiler is the best illustration of its completeness.
The original compiler created a new temporary variable every time a new value was created. Theoretically, limited scope and the application of data flow analysis allow temporaries to be reused. In practice, however, BASIC variables have unlimited scope and data flow analysis is mortally wounded by the error recovery system. Consequently, no attempt was made to reuse temporaries and non-trivial BASIC programs were found to overwhelm the symbol table with temporaries. The code was revised to reuse a temporary as soon as possible after its value was used for its intended purpose. Although the possibility of common subexpression optimizations was sacrificed, the number of temporaries was reduced by several orders of magnitude.
The empirical study of hash functions, crucial to any compiler, was carried out yielding a significant decrease in the number of accesses needed to locate a symbol table entry.[12] The hash table itself was lengthened resulting in a linear improvement.
Several lexical peculiarities were discovered during field testing. Although these could have been implemented in the LEX based lexical analyzer, the LEX code would require further obfuscation and the already large data table produced by LEX would grow still larger. It was observed that a low-tech lexical analyzer, in which decisions are embedded in instructions rather than data, would possess the flexibility to deal with irregularities in a straightforward manner. The cost of a dumb lexical analyzer in terms of speed and clarity could largely be recovered by using the first character of an incoming lexeme as an index into a table of functions each designed to recognize the remainder of the lexeme. That tokens can be distinguished largely by their first character is probably not simply fortuitous but the intent of the language designers.
The final nail in LEX’s coffin is LEX’s large data table for which the dumb lexical analyzer has no counterpart. As previously discussed, this table necessitated a separate lexical analysis pass. While the dumb approach requires more code, the code size of the parser pass was never an issue. In fact, by merging the two passes, the code needed to write and read the token stream is eliminated along with the redundant context sensitivity code the lexical analyzer needed because it could not query the parser. Finally, by eliminating a pass over the user program, any residual concerns regarding efficiency are eliminated as well as the token stream itself. Such a lexical analyzer was implemented, its status reduced to that of a mere module within the parser, or more appropriately, the analyzer.
The original compiler also, in zealous devotion to MS-DOS BASIC compatibility, faithfully implemented the external user defined function feature—the only means by which to return a function value from a routine written in another language. Unfortunately, this feature requires an absolute machine address and is virtually useless in the UNIX environment. A blatant extension to the language was implemented in which symbolic names, rather than indirect references to addresses, could be used to identify external routines. Reports from the field were unanimously favorable.
The components of the compiler responded very differently to porting. Having little to do with the operating system or machine architecture, the analyzer proved perfectly portable with the curious exception of a call to a trivial, allegedly portable UNIX library routine whose name was changed on some versions of UNIX. Rather than cope with such foolishness, perfect portability was achieved by simply coding the onomatophilic routine.
The library proved to be a problem because a BASIC statement exists for virtually everything possible in the UNIX operating system (and a few more that can not be implemented). Some of these possibilities lie in murky areas of the operating system “where angels fear to tread.”[13] These pitfalls are best avoided whenever possible by using UNIX library routines known to be implemented consistently across all systems. In practice the designer must study the UNIX manual for Version 7, then search for similar text in the verbose manuals for System V, Xenix and the various incantations of Berkeley UNIX.[14] Of course the results must be verified experimentally. Still, some features, most notably the terminal driver and the terminal capabilities database, are necessary yet implemented inconsistently. To cope with these differences, the effected portions of the library were coded as many ways as necessary and bracketed with the conditional compilation constructs of the C preprocessor.
The absence of portability had been assumed to be the major liability of the code generator, and thus of the compiler as a whole. And yet an interesting perhaps predictable stratification appeared in which the high level routines of the code generator remained completely intact while the low level, assembly language generating routines seemed to change with each port. These low level routines had another annoying but intriguing characteristic: buried and decomposed within the parameters of output function calls lay simple assembly language statements. Of course the macabre condition of these statements was due to the code generator’s need to output them. Still, experience showed that porting consisted largely of selecting the code generator most similar to that needed for the target machine, exhuming the assembly language statements, recoding them and reburying them—not a pretty sight. And yet, viewed as a code generator, as the C compiler obviously must, the program appeared to be a reasonable, yet ultimately unimportant, sequence of output function calls. Clearly, to aid the human porting machine, a preprocessor was needed to accept straightforward assembly language statements and produce cryptic output function calls. The preprocessor would bury the assembly language when the code generator was compiled, only to be resurrected when it was run. Such a preprocessor was implemented and named YAGG, given that its importance to the code generator is similar to that of YACC to the analyzer and that neither is the first of its kind.
In fact, YAGG has been extensively revised since its initial appearance and is nevertheless trivial compared to the state of the art. Yet it is successful because it solves the portability problem. Ports take two days and ten pages for minicomputer architectures including those of Digital Equipment, Motorola, Intel and National Semiconductor corporations and AT&T Bell Laboratories. Two weeks and fifteen pages are required for ports to mainframe architectures including the IBM 370 and the so-called RISC architectures.[7] The difference is largely due to popularity and orthogonality.
The entire compiler is portable except for a YAGG input file and a C include file maintained for each port. The latter controls the conditional compilation of the library and includes assembler and loader command line templates, which differ from system to system. These are compiled into the driver program (now written in C) which needs to invoke the native assembler and loader.
The importance of good documentation is difficult to overstate, even from the point of view of the capitalistic software developer. Issues not covered in the documentation become the questions on the lips of every customer, each of which must be answered individually by the developer’s technical support personnel. Substandard documentation creates the impression that the entire product is below par. A disgruntled customer who finds the software serviceable may even be tempted to redistribute the electronic medium sans hard documentation.
But few scholars earn degrees in both computer science and English and then resign themselves to telling stories about computer systems others have produced. Besides, had any such person wandered into a Basmark office, the need for a correctly functioning product would have dictated that his/her programming talents be exploited exclusively. Actually, neglecting documentation in favor of correct code proved to be a wise strategy. Most customers already possessed a familiarity with MS-DOS BASIC and/or its manuals. By focusing on compatibility, it became feasible to recommend the MS-DOS manuals as documentation for Basmark BASIC Nevertheless, documentation has been developed, at low priority, in parallel with software development by the same staff—with apologies to those who appreciate the English language.
Rather than reinvent the wheel and flirt with copyright infringement, Basmark’s manual was not patterned after the MS-DOS BASIC manual but rather the UNIX Programmer’s Manual.[14] The three main sections of Basmark’s manual are intended to be analogous to sections one and three of volume I and volume II of the UNIX manual. Two separate permuted indices were included for the benefit of two different audiences. A unique document which defines the language was placed at the beginning of the manual. Also included was the traditional manual page intended for the customer’s UNIX manual to document the augmented system.[15]
The austere style of the UNIX manual was also borrowed for the bulk of the manual. Although the style has been judged unreadable and therefore unacceptable, it has proven excellent for quick reference by providing unequivocal facts. In order to provide some instruction and illustration, examples were included in special subsections. Detailed instruction was deemed unnecessary given the large number of books on BASIC already available. Nevertheless, a brief tutorial was added at the end of the manual when the relatively new features of Version II became available.
Ongoing analysis of the compiler’s performance with real world BASIC programs revealed a variety of opportunities for improvement. But these changes would not radically increase performance and, more importantly, they would not result in code that was any more correct. Everyone clamors about getting there faster but in reality they would prefer to get there in one piece. Or, in the programmers’ vernacular, if it ain’t broke, don’t fix it. In time, the code began to stabilize; the frequency and magnitude of revisions tapered off. A downward spiral ensued in which the perception of the stability itself tended to squelch propositions of meritorious change. Morale went down; a sort of postnatal depression set in. Had the product’s patriarchy been left idle at this point, their exodus would have followed leaving the newborn product unmaintainable not to mention unimprovable. Happily, a new product was conceived.
MS-DOS BASIC was not a static concept indeed it evolved from BASICA to QuickBASIC 2.0 by the time Basmark BASIC was completed.[2] While customers continued to demand BASICA compatibility, they increasingly wanted more: the more powerful features of QuickBASIC. And, inevitably, new BASICA compilers, even BASIC to C translators, began to appear on the market. Marketing legions were being mobilized to vie for BASICA territory. Rather than do battle in the land of technological obsolescence, however, Basmark chose to maintain its leadership with a bold campaign to claim QuickBASIC for UNIX.
Thus inspired but with the temperance of years of experience, the design of Basmark BASIC Version II commenced, guided by a new strategy.
While compiler design theory tends to place its emphasis on the processes of translation, this approach to software analysis effectively had been disputed earlier by the work of D. E. Knuth which placed its emphasis on data organization.[16] Programming languages emphasize process over data in that their purpose is to transform data. The programmer can be lead to believe that motion is more important than organization despite the fact that well organized data needs less motion. Yet the purpose of a program is to transform the input to the output via an internal representation. So the important, but apparently unproductive, first step is to determine the organization of the internal representation by cataloging the ways it needs to be accessed in order to accomplish the transformation efficiently. While it is regrettable that the internal representation, per se, has no code associated with it and thus has no permanent record, at least the means to access it can be written and this is the second step. Finally the remainder of the program is written almost as an afterthought.
Unfortunately, the wisdom of this methodology was lost on the first Basmark compiler due to its shear complexity and the difficulty in pinpointing the internal representation. Given that the interface between the analyzer and code generator is the central point of transformation and that the compiler is ultimately a simple program, the internal representation may be viewed as the intermediate code and the symbol table. But the aforementioned methodology may be applied to the BASIC program as well: the intermediate code represents the mere transformation of the data represented by the symbol table. Therefore, the design of Version II was focused on the symbol table.
The symbol table needs to be accessed in many ways. Most importantly, the lexical analyzer needs to find symbols by name prioritized by their scope. Thereafter, identification numbers are adequate. The subprograms of Version II require a new wrinkle: at the end of a subprogram, local symbols must become inactive. Between the analyzer and the code generator, the symbol table needs to be accessed as a unit for temporary disk storage, which implies that the identification numbers can not be memory addresses. Also, the code generator must access symbols by class (variables, integer constants, etc.) in order to generate data space allocation directives. The final, and often overlooked, access method is the manner in which the memory is allocated. While dynamic allocation is important because the size of the table can not be predicted, a dynamic management system, in which space may be repeatedly allocated and deallocated, is not worthwhile since the opportunities to interleave the two are few and the paltry savings would be squandered by the system itself.
The symbol table implemented to fulfill these requirements included a hash table and a link associated with each symbol to facilitate access by name. As with the original compiler, a hashing function was implemented to determine the appropriate hash table element and thus the single linked list where the symbol could be found by searching. Because the search would be sequential, it could be exploited to implement scope priority by imposing order on the list, which simply required that new symbols be added to the head of the list since new symbols always have highest priority (and always need to be added). This became the installation process, used only by the lexical analyzer. Its result was an ID number used by the remainder of the compiler. All BASIC identifiers would be processed in this manner, including line numbers because of their similarity to line labels.
On the other hand, constants are best cataloged by type and held on separate lists to avoid creating duplicates. Internal variables, which have no names at all, are also held on separate lists. Among these are temporaries, some of which may be assigned registers by the code generator. While the analyzer controls the allocation of temporaries, it does not know how many registers are available. Yet maximal register usage was achieved easily by maintaining temporaries on lists ordered by frequency of use. An index associated with each temporary was used to inform the code generator of relative frequency and thus register priority. Of course all these lists required that a link be associated with each symbol.
Local symbols needed to be deactivated at the end of a subprogram, preferably by removing these symbols from the hashing structure in order to improve its performance. Since the code generator would need to access these symbols according to class, a process, known as flushing, was implemented to move them from the hashing structure to link lists based on class. At the end of the program, in fact, all symbols are destine for these linked lists needed by the code generator. This process, known as restructuring, empties the hashing structure completely. But because these processes are infrequent and yet can put each symbol’s link to good use, they are not directly accommodated in the structure of the symbol table, rather they are accomplished by exhaustive search of the hashing structure.
Once the hashing structure is emptied, the hash table, per se, need not be stored on disk for transfer to the code generator. In fact, besides some miscellaneous data including a few linked list headers, only the symbol table proper need be stored on disk. The miscellaneous data are easily implemented in contiguous storage to facilitate copying. This property is also desirable for the symbol table, not only for storage, but also for access by ID number. Yet the size of the table can not be predicted. This conflict was resolved by implementing table space allocation in large fixed sized chunks. The fixed size facilitates both storage and ID number access and is large enough to accommodate even a screen-long string literal. A special array, of modest length, records the memory locations of each chunk. Storage consists of writing each chunk in order. The code generator reads and allocates storage for each chunk and creates its own array. Because the chunk size is a power of two, an ID number is a simple binary composite of a chunk number (an index into the array) and a grain number (an index into the chunk). This makes for fast access by ID number yet makes the symbol table easy and fast to store while providing for dynamic growth.
The grain number, however, is not a byte number—it is much larger. While byte addressability would maximize memory usage, it would require wider ID numbers which would require more memory. And although a wide variety of data needs to be stored in the symbol table, none are very brief so the penalty in terms of wasted space tends to be small. A reasonable grain size was implemented so that narrow ID numbers could be used to address items in a huge symbol table.
In addition to facilitating compilation, the symbol table, of course, must provide the power and flexibility to maintain everything there is to know about the BASIC program’s data. Besides the link field, each symbol table entry has three other fields. The flags field is the most important of these. It contains short sequences of bits which encode type, scope and class. The flags field also contains a variety of bit flags the meaning of which depend largely upon the class. The name field generally contains an ID number for the text of the symbol’s name. Finally, the info field may be unused or contain a single value (such as the value of an integer constant) or an ID number for a list of values (such as the dimensions of an array). The definition seems increasingly vague due to the wide variety of items that need to be represented. Every effort was made to keep the contents of the symbol table entries as uniform as possible without wasting lots of memory creating well defined but rarely used fields.
Finally, the original compiler contained a variety of well defined but rarely used data structures for maintaining internal lists especially for use with list building productions in YACC. Rather than maintain code and manage memory allocation for each type of list, the compiler was revised to use one structure composed of two links and a single value. It was named the ylist for its utility with YACC and was generally used as a doubly linked list with a distinct header item the value of which was exploited to contain the list length. A single set of access routines was devised in which deallocation consisted only of preserving items for later use.
A nagging problem with the compiler was context sensitivity in which the lexical analyzer needs information from the parser. Compiler design theory insists that this data path ought not exist, indeed that lexical analysis may be completed prior to the start of parsing.[5] As previously discussed, the original compiler was implemented in this way. Unfortunately, the BASIC language necessitated context sensitivity and it was accomplished by a lexical analyzer enlightened with the needed contextual knowledge normally and more conveniently bestowed upon the parser. However, for reasons already discussed, the lexical analyzer and parser were integrated into an analyzer in the revised compiler. Then the two processes ran concurrently, creating the opportunity for feedback from the parser to the lexical analyzer. Yet this opportunity was not fully exploited, partially because the problem had been solved earlier in favor of independence but also because of the difficulty in coaxing information from the parser in a timely fashion.
But with Version II, more context sensitivity would be needed for line labels, common block names and some array references that look like simple variables (in addition to the pre-existing problems of line numbers that look like constants and external function names that look like simple variables). To make matters worse, the contexts themselves are more complex which would have exponential consequences in the lexical analyzer. But if these cases could be resolved in a straightforward way, the method could be applied to another new problem: the scope of variables. Prior to Version II, variables had no scope and so the lexical analyzer could present the parser with complete variables. But scope requires knowledge of the grammar; without this knowledge, a lexical analyzer is only capable of yielding an unruly mass of information. Clearly, the time had come to involve the parser.
The difficulty lay in YACC’s LALR(1) parser.[10] The “1” refers to one look-ahead token used to resolve ambiguities. Since single token ambiguities are common but tedious to consider, YACC uses the look-ahead token whenever necessary but does not report it. Under normal circumstances, this is wonderful but context sensitivity is another matter. To establish context in the parser, a semantic action for this purpose must be associated with a production. The naive designer believes that any token that may follow the production will be read by the lexical analyzer under the established context. Unfortunately, if the production can not be reduced without a look-ahead, the semantic action will not be executed when the lexical analyzer needs it.
The simplest and clearest way to establish context with the parser was devised as a series of epsilon productions whose semantic actions set state variables accessible to the lexical analyzer. The grammar symbols of these productions were easily inserted elsewhere in the grammar to change context. Single tokens in unusual contexts were expressed clearly by preceding them with such grammar symbols. The default context was restored by the lexical analyzer. In those cases where these single tokens were optional, the context establishing productions were not. If the parser were confronted with a state in which it expected both an epsilon production and a token or neither, it would have to read the token as the look-ahead in order to reduce the production. Therefore the case of the missing token nevertheless included the production to guarantee that it was reduced before the token was read.
Quadruples, or three address instructions, had been the intermediate language of the compiler. As a result, the original code generator implemented binary operations as three address instructions on the VAX. But most machines have only two address instructions, so these code generators usually implemented binary operations as three instructions to load, operate and store, each referencing one address and a temporary register. Even unary operations were aggravated by limited addressing modes often requiring additional instructions and registers in order to facilitate the pointers that resulted from array lookups. And even though the VAX could do all these things in a single instruction, the instruction was not much shorter or faster than a series of simple ones.
Three lessons had been learned since the original design: Firstly, sophisticated instructions did not eliminate complexity; they merely hid it. Secondly, the benefits of data flow analysis and the utility of quadruples are virtually eliminated by the error recovery system. And finally, a correct portable code generator is better than a clever unreliable one. Therefore, a single accumulator centered model was devised through which all data would pass. Despite this radical new approach, the implementation of binary operations was essentially unchanged, still consisting of a load, operate and store instruction. And since only one address was considered per instruction, only one address register needed to be available and the code was simplified. Admittedly smaller, faster code might have been possible, but correct code was even better.
On the most popular processors, in practice, the single accumulator model is often the only model because a lack of orthogonality requires that many operations be performed in a particular register and because the subroutine calling conventions preserve few registers. In fact much of the benefit of live variable analysis can be obtained by reordering the operands of commutative binary operations when the second operand is the accumulator and by eliminating redundant store and load pairs. Both these optimizations were implemented in the analyzer; the latter just prior to intermediate code output. The intermediate language itself was changed to doubles consisting of an op-code and one operand improving code density and reducing compiler execution time.
The implementation of dynamically allocated arrays and multiple modules in Version II had an intriguingly profound impact on the layout of data at run-time. Three regions are available for data, each with its own unique capability: The data segment can record non-zero constants. The bss segment can accommodate COMMON blocks.* Finally the heap consists of data allocated at run time.
BASIC data were assigned to these regions based on their capabilities: All constants, including strings, were assigned to the data segment. BASIC strings consist of a header, intentionally devised to be zero for the null string, and the actual text. Non-empty, non-constant string text was assigned to the heap. BASIC arrays consist of a header and the actual elements. Non-empty, or dimensioned, dynamic array elements were assigned to the heap while static array headers were assigned to the data segment. All other data were assigned to the bss segment: simple variables including string headers, dynamic array headers and static array elements. A data structure need not lie entirely in one region. The most interesting of these is the static string array with its array header in the data segment, array elements (string headers) in the bss segment and string text in the heap. If such an array were also COMMON, each module would have a private copy of the header pointing to common elements.
Version II also required that a number of features be added to the run-time support library none of which was especially profound, with the possible exception of pipes. Unlike disk files and ttys, pipes require separate channels for input and output. The implementation, therefore, required yet another generalization of the I/O system in which one or two low level, channel oriented structures were governed by a high level structure. Finally, in order to accommodate large logical unit numbers without monopolizing a lot of memory, the table of high level structures was lengthened; however, the low level structures for them would be allocated dynamically only when needed.
* The term “COMMON blocks” comes from FORTRAN and has the same meaning in BASIC. In this context, however, it refers only to contiguous blocks of memory shared by modules; the layout of items within the blocks is assumed to have been resolved by other means.
In the words of the eminent teacher, philosopher and computer practitioner, Prof. F. Way, III, Case Western Reserve University, “Know the answer before you start.” Perhaps the only thing that elevates software designers above the Keystone Kops is that the comedy of errors grows ever more sophisticated with time and experience.
And while the top-down, bottom-up, left to right debate drones on, deadlines dictate design philosophy. This is not to imply that the compiler was designed ad hoc. On the contrary, each philosophy was exploited for its merits with an intolerance only for ideological zeal.
But the impatience of the marketplace was often at odds with the contemplativeness of design. Fortunately the designers were learning from the marketeers and vice versa. The result was a Darwinian philosophy in which the compiler adapted to the needs of the customer rather than predict or impose. This is not simply a philosophy that has served well in fulfilling static needs, it is a dynamic which continues to produce the leading compilers for the evolving needs of the customer.
A QuickBASIC 4.0 and BASIC 6.0 compatible compiler is currently being implemented, including a facility to report diagnostics in languages foreign to Americans. Windowing and graphics capabilities are being implemented via interfaces with the de facto standard, device independent systems, X-Windows and IBM’s VDI.
Modification of the code generator is being considered to accommodate the addressing mode poor RISC architectures. The merits of a machine-independent intermediate code optimizer are being considered. Construction of applications libraries to support business and commercial applications are planned.
Basmark and YAGG are trademarks of Basmark Corp. IBM 370 is a trademark of International Business Machines Corp. Intel 80286 is a trademark of Intel Corp. Motorola 68000 is a trademark of Motorola Corp. MS-DOS and Xenix are trademarks of Microsoft Corp. UNIX, Version 7 and System V are trademarks of AT&T Bell Laboratories. VAX is a trademark of Digital Equipment Corp.
We are indebted to the following authors, and to the programmers who made these documents possible:
BASIC Compiler, Rev. Ed., IBM Personal Computer Hardware Reference Library, Boca Raton, FL: International Business Machines Corp., 1982.
Microsoft BASIC Interpreter (for the Xenix 286 Operation System), Bellevue, WA: Microsoft Corp., 1982.
Microsoft QuickBASIC Compiler, Bellevue, WA: Microsoft Corp., 1982.
Microsoft QuickBASIC Compiler, Ver. 2, Bellevue, WA: Microsoft Corp., 1985.
Microsoft QuickBASIC, Ver. 4, BASIC Language Reference, Bellevue, WA: Microsoft Corp., 1987.
A. V. Aho, R. Sethi and J. D. Ullman, Compilers, Principles, Tools and Techniques, Reading, MA: Addison-Wesley, 1986.
VAX Architecture Handbook, Maynard, MA: Digital Equipment Corp., 1981.
MC68000 16-Bit Microprocessor User’s Manual, 3rd Ed., Englewood Cliffs, NJ: Prentice-Hall, 1982.
IBM System/370 Principles of Operation, 10th Ed., Armonk, NY: International Business Machines Corp., 1983.
Series 32000 Instruction Set Reference Manual, Santa Clara, CA: National Semiconductor Corp., 1984.
iAPX 286 Programmer’s Reference Manual, Santa Clara, CA: Intel Corp., 1984.
MC68020 32-Bit Microprocessor User’s Manual, 2nd Ed., Englewood Cliffs, NJ: Prentice-Hall, 1985.
MC68881 Floating-Point Coprocessor User’s Manual, Phoenix: Motorola Inc., 1985.
Assembler Language Reference (IBM RT PC), Armonk, NY: International Business Machines Corp., 1985.
WE32100 Microprocessor Information Manual, Morristown, NJ: AT&T Technologies, Inc., 1985.
80386 Programmer’s Reference Manual, Santa Clara, CA: Intel Corp., 1986.
80387 Programmer’s Reference Manual, Santa Clara, CA: Intel Corp., 1987.
The SPARC Architecture Manual, Mountain View, CA: Sun Microsystems, Inc., 1987.
UNIX Programmer’s Manual, 7th Ed., VAX Ver., Berkeley, CA: Univ. of California, 1981.
The Xenix System V Operating System, Santa Cruz, CA: The Santa Cruz Operation, Inc., 1983.
UNIX System V Programmer’s Reference Manual, Englewood Cliffs, NJ: Prentice-Hall, 1986.
from The Basmark QuickBASIC Programmer’s Manual by Lawrence Leinweber