Saturday, February 28, 2009

An ideal compiler architecture

I found the compiler architecture description here, http://lambda.uta.edu/cse5317/notes/node5.html to be the most appealing from all the definitions I have read upto now.

Particularly I like the following idea:

"Suppose that you want to build compilers for n programming languages (eg, FORTRAN, C, C++, Java, BASIC, etc) and you want these compilers to run on m different architectures (eg, MIPS, SPARC, Intel, alpha, etc). If you do that naively, you need to write n*m compilers, one for each language-architecture combination.

The holly grail of portability in compilers is to do the same thing by writing n + m programs only. How? You use a universal Intermediate Representation (IR) and you make the compiler a two-phase compiler. An IR is typically a tree-like data structure that captures the basic features of most computer architectures. The first phase of this compilation scheme, called the front-end, maps the source code into IR, and the second phase, called the back-end, maps IR into machine code. That way, for each programming language you want to compile, you write one front-end only, and for each computer architecture, you write one back-end. So, totally you have n + m components."


Though the above idea of compiler construction is quite ideal, the author cautions with these statements:

"But the above ideal separation of compilation into two phases does not work very well for real programming languages and architectures. Ideally, you must encode all knowledge about the source programming language in the front end, you must handle all machine architecture features in the back end, and you must design your IRs in such a way that all language and machine features are captured properly."

I guess, this is something interesting to think about. How good it will be, if all major programming languages translate to the same intermediate representation.

No comments: