

|
A compiler is a computer program that translates a series of statements written in one computer language (called the source code) into a resulting output in another computer language (often called the object or target language).
Most compilers translate a source code text file, written in a high level language to object code or machine language, e.g. into an executable .EXE or .COM file that may run on a computer or a virtual machine. However, translation from a low level language to a high level one is also possible; this is normally known as a decompiler if it is reconstructing a high level language program which (could have) generated the low level language program. Compilers also exist which translate from one high level language to another, or sometimes to an intermediate language that still needs further processing; these are known as transcompilers (or sometimes as cascaders).
Typical compilers output so-called objects that basically contain machine code augmented by information about the name and location of entry points and external calls (to functions not contained in the object). A set of object files, which need not have all come from a single compiler provided that the compilers used share a common output format, may then be linked together to create the final executable which can be run directly by a user. When this process is complex, a build utility is often used.
Contents |
Several experimental compilers were developed in the 1950s (see, for example, the seminal work by Grace Hopper on the A-0 language), but the FORTRAN team led by John Backus at IBM is generally credited as having introduced the first complete compiler, in 1957. COBOL was an early language to be compiled on multiple architectures, in 1960. [1]
The idea of compilation quickly caught on, and most of the principles of compiler design were developed during the 1960s.
A compiler is itself a computer program written in some implementation language. Early compilers were written in assembly language. The first self-hosting compiler — capable of compiling its own source code in a high-level language — was created for Lisp by Hart and Levin at MIT in 1962 [2]. The use of high-level languages for writing compilers gained added impetus in the early 1970s when Pascal and C compilers were written in their own languages. Building a self-hosting compiler is a bootstrapping problem -- the first such compiler for a language must be compiled either by a compiler written in a different language, or (as in Hart and Levin's Lisp compiler) compiled by running the compiler in an interpreter.
Most compilers are classified as either native or cross-compilers.
A compiler may produce binary output intended to run on the same type of computer and operating system ("platform") as the compiler itself runs on. This is sometimes called a native-code compiler. Alternatively, it might produce binary output designed to run on a different platform. This is known as a cross compiler. Cross compilers are very useful when bringing up a new hardware platform for the first time (see bootstrapping). Cross compilers are necessary when developing software for microcontroller systems that have barely enough storage for the final machine code, much less a compiler. Compilers which are capable of producing both native and foreign binary output may be called either a cross-compiler or a native compiler depending on a specific use, although it would be more correct to classify them as a cross-compilers.
Interpreters are never classified as native or cross-compilers, because they don't output a binary representation of their input code.
Virtual machine compilers are typically not classified as either native or cross-compilers. However, if need be, they can be classified as one or the other, especially in the less usual cases where a compiler is running inside the same VM (making it a native compiler), or where a compiler is capable of producing an output for several different platforms, including a VM (making it a cross-compiler).
All compilers are either one-pass or multi-pass.
While the typical multi-pass compiler outputs machine code from its final pass, there are several other types:
DOALL statements).Many people divide higher-level programming languages into compiled languages and interpreted languages. However, there is rarely anything about a language that requires it to be compiled or interpreted. Compilers and interpreters are implementations of languages, not languages themselves. The categorization usually reflects the most popular or widespread implementations of a language -- for instance, BASIC is thought of as an interpreted language, and C a compiled one, despite the existence of BASIC compilers and C interpreters.
There are exceptions; some language specifications assume the use of a compiler (as with C), or spell out that implementations must include a compilation facility (as with Common Lisp). Some languages have features that are very easy to implement in an interpreter, but make writing a compiler much harder; for example, SNOBOL4, and many scripting languages are capable of constructing arbitrary source code at runtime with regular string operations, and then executing that code by passing it to a special evaluation function.
In the past, compilers were divided into many passes[1] to save space. A pass in this context is a run of the compiler through the source code of the program to be compiled, resulting in the building up of the internal data of the compiler (such as the evolving symbol table and other assisting data). When each pass is finished, the compiler can free the internal data space needed during that pass. This 'multipass' method of compiling was useful in the early days of computing due to the small main memories of host computers relative to the source code and data.
Many modern compilers share a common 'two stage' design. The front end translates the source language into an intermediate representation. The second stage is the back end, which works with the internal representation to produce code in the output language. The front end and back end may operate as separate passes, or the front end may call the back end as a subroutine, passing it the intermediate representation.
This approach mitigates complexity separating the concerns of the front end, which typically revolve around language semantics, error checking, and the like, from the concerns of the back end, which concentrates on producing output that is both efficient and correct. It also has the advantage of allowing the use of a single back end for multiple source languages, and similarly allows the use of different back ends for different targets.
Often, optimizers and error checkers can be shared by both front ends and back ends if they are designed to operate on the intermediate language that a front-end passes to a back end. This can let many compilers (combinations of front and back ends) reuse the large amounts of work that often go into code analyzers and optimizers.
Certain languages, due to the design of the language and certain rules placed on the declaration of variables and other objects used, and the predeclaration of executable procedures prior to reference or use, are capable of being compiled in a single pass. The Pascal programming language is well known for this capability, and in fact many Pascal compilers are themselves written in the Pascal language because of the rigid specification of the language and the capability to use a single pass to compile Pascal language programs.
The compiler front end consists of multiple phases itself, each informed by formal language theory:
While there are applications where only the compiler front end is necessary, such as static language verification tools, a real compiler hands the intermediate representation generated by the front end to the back end, which produces a functional equivalent program in the output language. This is done in multiple steps:
Compiler analysis is the prerequisite for any compiler optimization and they tightly work together. For example, dependence analysis is crucial for loop transformation.
In addition, the scope of compiler analysis and optimization vary greatly, from as small as a basic block to the procedure/function level, or even over the whole program (interprocedural optimization). Obviously, a compiler can potentially do a better job using a broader view. But that broad view is not free: large scope analysis and optimizations are very costly in terms of compilation time and memory space; this is especially true for interprocedural analysis and optimizations.
The existence of interprocedural analysis and optimization is common in modern commercial compilers from SGI, Intel, Microsoft, and Sun Microsystems. The open source GCC was criticized for a long time for lacking powerful interprocedural optimizations, but it is changing in this respect. Another good open source compiler with full analysis and optimization infrastructure is Open64, which is used by many organizations for commercial purposes.
Due to the extra time and space needed for compiler analysis and optimization, most compilers choose to skip them by default. Users have to use compilation options to explicitly tell the compiler which optimizations should be enabled.