Notes on the design of the Melbourne Mercury compiler

This file contains an overview of the design of the Melbourne Mercury compiler.

For an overview of how the compiler fits into the rest of the Mercury system (runtime, standard library, mdbcomp library, and so on), see overall_design.html.

Outline

The main job of the compiler is to translate Mercury into a target language, which can be C, Java or C#. Once upon a time, it could also translate a (smaller or larger) subset of Mercury to Aditi, to Erlang, to .NET's IL, and to bytecode for a once-planned but never implemented bytecode interpreter.

The top-level of the compiler is in the file mercury_compile.m. This forwards all of the work to the file mercury_compiler_main.m, which is a submodule of the top_level.m package. The basic design is that compilation is broken into the following stages:

Note that HLDS stands for ``high-level data structure'', which is the compiler's main internal representation of Mercury code.

Note that in reality the separation is not quite as simple as that. Although parsing is listed as step 1 and semantic analysis is listed as step 2, the last stage of parsing actually includes some semantic checks. And although optimization is listed as steps 3 and 5, it also occurs in steps 2, 4, and 6. For example, elimination of assignments to dead variables is done in mode analysis; middle-recursion optimization is done during code generation; and a few low-level optimizations are done in llds_out_instr.m as we are spitting out the C code.

In addition, the compiler is actually a multi-targeted compiler with several different back-ends.

mercury_compile_main.m is mostly concerned with option and argument processing, which is complicated by the fact that options can come from Mercury.options files, as well as from files named by other options. mercury_compile_main.m itself oversees the reading in and parsing of source files, but it then subcontracts the supervision of the later steps to other modules.

  • Step 1, the conversion of parse trees to the initial version of the HLDS, is managed by mercury_compile_make_hlds.m.
  • Step 2, semantic analysis of the HLDS, is managed by mercury_compile_front_end.m.
  • Step 3, high level transformations on the HLDS, is managed by done mercury_compile_middle_passes.m.
  • Steps 4, 5 and 6, code generation, optimization and output, are all managed by

    The LLDS backend is so named because its main data structure is the ``low-level data structure'', which is the compiler's representation of imperative code at a level that is only slightly above assembly code (pseudo)instructions, containing references to virtual machine registers (that may correspond to real machine registers), stack slots, labels, and jumps. The LLDS backend was the first backend, and it generates low-level C, preferably with GNU C extensions.

    The MLDS backend is so named because its main data structure is the ``medium-level data structure'', which is the compiler's representation of imperative code at a level that corresponds to handwritten C or Java code, containing e.g. assignment, if-then-else, switch and while statements inside functions and/or class methods. The MLDS backend was the second backend, and can generate (standard) C, Java or C#.

    The modules in the compiler are structured by being grouped into "packages". A "package" is just a meta-module, i.e. a module that contains other modules as submodules. Every submodule is stored in a separate file. (In most cases, the name of this file is just the final component in their full module name, though in a few cases, it includes some of the earlier components as well.) We have a package for the top-level, a package for each main pass, and finally there are also some packages for library modules that are used by more than one pass.

    Taking all this into account, the package structure looks like this:

    In addition to the packages mentioned above, there are also packages for the build system:

    Option handling

    Option handling is part of the libs.m package.

    The command-line options are defined in the module options.m. mercury_compile.m calls library/getopt.m, passing the predicates defined in options.m as arguments, to parse them. It then invokes handle_options.m (and indirectly, op_mode.m and compute_grade.m) to postprocess the option set. The results are represented using the type globals, defined in globals.m. (Part of the globals structure is defined in optimization_options.m, which is generated automatically from a higher-level specification.) The globals structure is available in the HLDS representation, but it is passed around as a separate argument both before the HLDS is built and after it is no longer needed.

    After we have processed the options, the rest of the action is co-ordinated either from mercury_compile.m or from make.m, depending on whether the `--make' option is in effect.

    The rest of this document contains a brief introduction each module of the compiler. If you want more information about either the purpose or the design of a particular module, see the documentation at the start of that module's source code.

    Build system

    Support for `--make' is in the make.m package, which contains the following modules:

    The build process also invokes routines in compile_target_code.m, which is part of the backend_libs.m package (see below).

    Front end

    1. Parsing

    The parse_tree.m package

    The first part of parsing is in the parse_tree.m package, which contains the modules listed below (except for the library/*.m modules, which are in the standard library). This part produces the parse_tree.m data structure, which is intended to match up as closely as possible with the source code, so that it is suitable for tasks such as pretty-printing.

    We transform the results of lexical analysis (done by library/mercury_term_lexer.m) into the Mercury parse tree in two stages:

    (There is also a parsing stage 3, which transforms the parse tree into the HLDS, but that is not part of the parse_tree.m package itself.)

    The parse tree data structure definition is in prog_data.m, prog_data_event.m, prog_data_foreign.m, prog_data_pragma.m, prog_data_used_modules.m, prog_item.m and file_kind.m, while the code to create it is in the following modules.

    There are several modules whose collective job it is to print (parts of) the parse tree.

    There are several modules that provide utility predicates that operate on (parts of) the parse tree.

    The second main use of the parse tree, besides transformation into HLDS, is the generation and consumption of interface files. This is done by the following modules.

    We do most, though not all, forms of module qualification on programs when they are represented as parse trees. This means that we add module qualifiers to all types, insts and modes, check that every referred-to type, inst and mode actually exists, and that there is only possible match. This is done in the parse_tree package because it must be done before the `.int' and `.int2' interface files are written. This code also checks whether imports are really needed in the interface.

    This work is done by the modules of the module_qual.m package.

    As to what is module qualified when:

    The parse tree module also contains equiv_type.m, which does expansion of equivalence types, and of `with_type` and `with_inst` annotations on predicate and function type and mode declarations.

    Expansion of equivalence types is really part of type-checking, but is done on the parse tree rather than on the HLDS because it turned out to be much easier to implement that way. (Though later we had to add a pass the expand equivalence types in the HLDS anyway.)

    The hlds.m package

    Once the stages listed above are complete, we then convert from the parse_tree data structure to another data structure which no longer attempts to maintain a one-to-one correspondence with the source code. This data structure is called the High Level Data Structure (HLDS), and it is defined in the hlds.m package.

    The last stage of parsing is this conversion to HLDS, which is done mostly by the following submodules of the make_hlds subpackage in the hlds package.

    Fact table pragmas are handled by fact_table.m (which is part of the ll_backend.m package). That module also reads the facts from the declared file and compiles them into a separate C file used by the foreign_proc body of the relevant predicate.

    The HLDS data structure itself is spread over the following modules:

    Some modules implement a pass that decides the representation of every type used by the module being compiled:

    The submodules of the hlds_out.m package contain predicates to dump the HLDS to a file. These predicates print all the information the compiler has about each part of the HLDS. The module hlds_desc.m, by contrast contains predicates that describe some parts of the HLDS (e.g. goals) with brief strings, suitable for use in progress messages used for debugging.

    The module hlds_defns.m contains code to print the set of definitions in the HLDS to a file. When dividing a module into two or more submodules, one can use the information thus generated to check whether the new modules include every type, inst, mode, predicate and function definition in the original module exactly once. (The other sorts of definitions, e.g. typeclass definitions, are typically so few in number that one can keep track of them in one's head.)

    The module hlds_call_tree.m contains code to print out the call tree of the local predicates and functions in the HLDS, ordered by a depth-first left-to-right traversal. This can suggest to programmers an order that the predicates and functions of the module can be put into in the source file, or at least provide a starting point for such an order.

    The hlds.m package also contains some utility modules that contain various library routines which are used by other modules that manipulate the HLDS:

    2. Semantic analysis and error checking

    This is the check_hlds.m package, with support from the mode_robdd.m package for constraint based mode analysis.

    Any pass which can report errors or warnings must be part of this stage, so that the compiler does the right thing for options such as `--halt-at-warn' (which turns warnings into errors) and `--error-check-only' (which makes the compiler exit at the end of semantic analysis).

    Middle end

    3. High-level transformations

    This is the transform_hlds.m package.

    The results of these program analyses are written out to both `.opt' and `.trans_opt' files by intermod_analysis.m. `.trans_opt' files consist of nothing but analysis results, while `.opt' files also contain other information written out by intermod.m. This other information falls into two categories.

    The module intermod_order_pred_info.m contains utilities needed by both intermod.m and intermod_analysis.m.

    The code of intermod.m uses intermod_info.m, intermod_decide.m and intermod_status.m as helpers.

    The intermod_mark_exported.m module updates the statuses of entities that intermod.m would export. It is used by compiler invocations whose task is something other than the creation of `.opt' files.

    Most of the remaining HLDS-to-HLDS transformations are optimizations:

    The module transform.m contains stuff that is supposed to be useful for high-level optimizations (but which is not yet used).

    The last three HLDS-to-HLDS transformations implement term size profiling (size_prof.m and complexity.m) and deep profiling (deep_profiling.m, in the ll_backend.m package). Both passes insert into procedure bodies, among other things, calls to procedures (some of which are impure) that record profiling information.

    3-6 The LLDS and MLDS backends

    a. LLDS backend

    This is the ll_backend.m package.

    3a. LLDS-specific HLDS -> HLDS transformations

    Before LLDS code generation, there are a few more passes which annotate the HLDS with information used for LLDS code generation, or perform LLDS-specific transformations on the HLDS:

    4a. LLDS code generation

    Code generation converts HLDS into LLDS. For the LLDS back-end, this is also the point at which we insert code to handle debugging and trailing, and to do heap reclamation on failure.

    The code generator also calls middle_rec.m to do middle recursion optimization, which is implemented during code generation.

    The code generation modules above make use of many service modules.

    The purpose of export.m is to allow non-Mercury-generated C code to call Mercury-generated C code. For each `pragma export' declaration, it generates C code fragments which declare and define the C functions which are the interface stubs for procedures exported to C.

    The generation of constants for RTTI data structures could also be considered a part of code generation, but for the LLDS back-end this is currently done as part of the output phase (see below).

    The result of code generation is the Low Level Data Structure (llds.m), which may also contain some data structures whose types are defined in rtti.m. The code for each procedure is generated as a cord of code fragments which is then flattened.

    5a. LLDS transformations

    Most of the various LLDS-to-LLDS optimizations are invoked from optimize.m. They are: In addition, stdlabel.m performs standardization of labels. This is not an optimization itself, but it allows other optimizations to be evaluated more easily.

    The module opt_debug.m contains utility routines used for debugging these LLDS-to-LLDS optimizations.

    Several of these optimizations (frameopt and use_local_vars) also use livemap.m, a module that finds the set of locations live at each label.

    Some of the low-level optimization passes use basic_block.m, which defines predicates for converting sequences of instructions to basic block format and back, as well as opt_util.m, which contains miscellaneous predicates for LLDS-to-LLDS optimization.

    Use_local_vars numbering also introduces references to temporary variables in extended basic blocks in the LLDS representation of the C code. The transformation to insert the block scopes and declare the temporary variables is performed by wrap_blocks.m.

    Depending on which optimization flags are enabled, optimize.m may invoke many of these passes multiple times.

    6a. LLDS output

    b. MLDS BACK-END

    This is the ml_backend.m package.

    The original LLDS code generator generates very low-level code, since the LLDS was designed to map easily to RISC architectures. The MLDS backend generates much higher-level code, suitable for generating not just C, but also Java and C# (and once upon a time, the intermediate language or IL of the .NET platform). This back-end uses the Medium Level Data Structure (mlds.m) as its intermediate representation. While the LLDS operates on the level of assembly languages (a second generation language, with the first generation being machine code), the MLDS operates on the level of a standard third generation imperative programming language.

    3b. MLDS-specific HLDS -> HLDS transformations

    Before code generation, there is a pass which annotates the HLDS with information used for code generation:

    When creating the MLDS back-end, we have tried to put into practice the lessons we have learned from writing the LLDS backend. One of these is that we prefer to do things as HLDS to HLDS transformations where possible, since this is much easier to debug initially and to modify later than the alternative approach of doing things inside the HLDS to MLDS code generator. Thus we have these two passes that use HLDS transformations to achieve effects for the MLDS backend that the LLDS backend achieves using code fragments that are scattered all over the LLDS code generator.

    4b. MLDS code generation

    The MLDS code generator also uses several auxiliary modules that do not themselves generate code.

    5b. MLDS transformations

    6b. MLDS output

    There are currently three target languages in which we can output MLDS code: C, Java, and C#. (They were added in that order.) Generated C code is intended to be given to a C compiler to turn the .c file into a .o or .pic_o file. Generated Java code is intended to be given to a Java compiler (normally javac) to turn the .java file into a .class file containing Java bytecodes. Generated C# code is intended to be given to a C# compiler to turn the .cs file into a .dll or .exe.

    In more detail:

    The mlds_to_target_util.m module contains types, functions and predicates that are needed by more than one of these MLDS backends.

    Smart recompilation

    The Mercury compiler can record program dependency information to avoid unnecessary recompilations when an imported module's interface changes in a way which does not invalidate previously compiled code.

    This functionality is implemented by the recompilation.m package.

    Miscellaneous

    The functionality of the above modules requires support from a significant number of utility predicates. Many of these are in modules that have been listed above, but some of them are not. Most of these are either in the backend_libs.m package (utility predicates used in more than one backend) or in the libs.m package (utility predicates used in many parts of the compiler, not just the various backends).

    Currently undocumented