Friday 10 April 2020

Compilation Stages

Introduction:

The embedded engineers all program the embedded systems using a high level language like C, C++ or python etc. All the functionality that the embedded system need to perform are written in these high level languages which are human readable. But these language is not machine readable. Machines can only understand binary, i.e 1s and 0s. So we need a system which translate high level language to machine readable low level language. For example the embedded C code has to be converted to hex files or executable files (*.exe). This procedure is called as compilation.

This compilation doesn't happen in a single step but in multiple stages. So all these stages together are called compilation stages.

Definition of Compilation Stages: Various Stages involved in converting Human readable high level language (like C, C++, Java, Python etc) into machine readable low level language (Binaries like hex, s37, exe etc) which can be processed in underlying micro-controller are called as compilation stages.



So the system which implements compilation stages is usually called as build system and this process is called as building. All the input files containing High level language functions and logic together are called as project and upon building this project (running compilation stages) the final executable file is generated. 

Fig 1  Purpose of Compilation stages 



In this post, we shall try to understand what are various compilation stages and what happens in each of them.


Before jumping into compilation stages, let us see what are the possible input and output file formats. Since this is embedded C series, we shall consider only C programming language as high level language.

Input File Formats: The input files possible are source files (*.c) where all the function and logic are written and the header files (*.h) where all the function prototype and macros etc are defined which are  included (#include) in the source files.

Output File Formats: The possible output file formats are executable files (*.exe). It could be hex format by intel or infineon (*.hex). It could be S-Record formats by motorola (*.s37, *s28, *s19). It could be an executable linkable format file too (*.elf). Depending on the flashing tool and underlying hardware, a particular format is chosen.

Note : All the file formats (input, intermediate and output) are listed and explained in a table at the end of this post. Scroll down to navigate to the table


Which are the stages:

The compilation stages are mainly 6 steps in embedded system. They are as follows:
  1. Preprocessing
  2. Compilation
  3. Assembling
  4. Linking
  5. Locating
  6. Loading
Fig 2  Compilation Stages


Let us understand each stage one by one

Preprocessor

    In Embedded C, there are lot of preprocessor directives. For example to include header files we have #include and to define a macro we have #define. We have conditional directives like #if, #elif and #endif etc. For compiler specific directives we have #pragma. All these preprocessor directive start with hash symbol (#). Preprocessor directives are very widely used in embedded C. However why we use them and its advantages are not really in the scope of this article and so Ill skip it. But know that preprocessor stage will process all these directives and remove the comments from c files and generate what is called as intermediate file. The number of intermediate files are same as number of c files. That is if there are say 20 c files and some 46 header files which are included in these c files, then preprocessor would copy content of these header files in the C file wherever included, and replace macros and process other preprocessing directives and remove comments and generate 20 intermediate files.
    So preprocessor takes c files and header files as input and generate intermediate file as output. Then these intermediate files are further sent to compiler stage.


Compiler

    Compiler takes the intermediate file and compiles the syntax of embedded C in those files. And replaces the syntax with corresponding assembly codes. Assembly codes and the mnemonics are specific to underlying processors and hence for different controllers, different compilers are needed. For example for intel x86 we have gcc compiler. that means gcc compiler is going to process the c programming syntax and replace them with assembly code using mnemonics of x86 processor. 
    Also if you are using different compiler like gnu compiler for infineon processor then it will use mnemonic specific to IFX controller. If this compiler is run on x86 controller then this is called as cross compilation. We shall explain regarding cross compilation is a separate article. But note that the dependency on underlying hardware started at compiler level.
    So compiler processes all intermediate files given by preprocessor and creates that many number of assembly files (*.asm) . These assembly files are forwarded to assembler.


Assembler

    Assembler processes the assembly files into corresponding binary codes and creates relocatable object files. In these object files, it creates code and data section separately and uses symbols instead of hard coded values for addresses. The number of object files are equal to as many assembly files were given to it. 
    These object files are binary files and not human readable. Also These object files are called relocatable object files because the object files can be re allocated to a different hardware of same family. This is possible due to the address locations are not fixed but tags are used instead and a symbol to address map table is available. 
    When the modules are shared between two entities like may be 2 companies, then they provide these relocatable object files it self as the company which is sharing its module, they don't want the company who are taking the module to understand the logic as it may be the IP property of the company. They usually share these relocatable object files or object library which are collection of object files in a single file (*.a). 
    These multiple relocatable object files and library files are sent to Linker. If libraries and relocatable object files are present from various entities along with our source file and header files, those relocatable files are directly taken as input to linker along with relocatable object files generated by assembler.


Linker

    As the name itself indicates, Linker links multiple relocatable object files into a single relocatable object file. This process of resolving the references between multiple object files is called as linking. 
    The linker phase combines relocatable object files (.o files, generated by the assembler), and libraries into a single relocatable linker object file (.out). The linker can simultaneously link all programs for all cores available on a target board. Fig 3 shows the functionality of a linker.

Fig 3  Functionality of Linker

So after linking multiple object file to a single relocatable object file (*.out), it forwards it to Locator.

Locator

    The locator phase assigns absolute addresses to the linker object file and creates an absolute object file which you can load into a target processor. As we had discussed that relocatable object file contains symbols for addresses, the locator replaces all those symbols with hard-code values of address. Here there are logical addresses which are used by the CPU of underlying micro-controller. The resulting object file is called as absolute object file.
    So the locator takes the relocatable object file from linker and generates absolute object file and forwards it to loader.

Loader

    Loader converts the absolute object files to the final executable binary files. Absolute object files contains logical addresses. These logical address are the address which are used by CPU. But these may not be actual physical memory address location in the micro-controller where that instruction or data will be stored. Those addresses are called as physical address. Usually there will be some offset between the physical and logical address. Loader would update the physical address in the absolute object file and create final binary files which can be flashed in embedded system.
    Some of the final binary format are explained below:
  • Hex (*.hex): Intel came up with this binary format. This is mainly for little endian controllers. Even Infineon uses this format. Click HERE to learn more about this format.
  • S-RECORD (*.s37, *.s28, *.s19): Motorola also created their own binary formt for big endian controllers. For 32 bit cotrollers S37 format. For 24 bit controllers S28 format and for 16 bit controllers S19 format are used. Click HERE to learn more about this format.
  • ELF (*.elf): This is another binary format mainly used for debugging purposes. Click HERE to learn more about this format.
  • Executable (*.exe): These formats are usually generated for normal c programming to be run on PC. Click HERE to know more about this format.

Once the final Binary files are available, any debugger or bootloader technology can be used to flash these files in memory(ROM) of  embedded system as microcontroller can understand these binary language.



File Format Description
Source File *.c Contains the logic and functions in high level language. We write our programs in this file only.
Header File *.h Header files are helping file of your C program which holds the definitions of various functions and their associated variables that needs to be imported into your C program with the help of pre-processor #include statement.
Intermediate File *.i Same as source file after processing of all preprocessor directives. OneToOne with c files but all macros are replaced and all comments removed and all the header files are included in the file.
Assembly File *.asm This file contains assembly language mnemonics specific to the underlying processor. Its a low level language.
Relocatable object file *.obj This is a object file containing symbols (Tags) instead of hard coded value for address and it contains code and data section.
Absolute object file *.o This object file contains the exact logical address instead of symbols for address. It also contains code and data section
Library file *.a Group of object files is called as library file. Usually when people share the modules with others but dont want to tell them the logic, they share it in form of library file which is not human readable.
Executable *.exe, *.hex, S-RECORD, *.elf all these are Binary executable files. these contain binary values with different formats which can be understood by underlying microcontroller. intel defined hex file format which is little endian and motorola has defined s record (like s37, s28, s19) which is big endian. elf is also binary file which is used for debugging also. 

My YouTube Video explaining Compilation stages




4 comments:

  1. Nicely explained. Clear and crisp. Thank you

    ReplyDelete
  2. Hello Shyam,
    It was Very informative and clean video.
    I have a question can you pls try to answer:
    How locater fills the address (virtual) I mean on what basis??
    And as per my understanding is based on area like where the variable or code will seat for example ram area.
    How locater knows it??

    ReplyDelete
    Replies
    1. Dear Ansuman,
      I am glad that you liked my article and video and proactively asking doubts. Its the right way to learn. So thanks for that. And now let me try to answer your question as per my knowledge.
      The address related inputs can be given at various levels. It can be taken from memmap file generated during compilation, By #pragma code and data sections, by exclusively reserving some memory area in lsl file for linker (in case of tasking toolset) or compiler specific directive "__at()". All these kind of methods make locator to assign particular code and data in their own locations. But if nothing is mentioned, then locator itself will allocate some available memory adress for such code or data.

      Delete
  3. Hi shyam,

    Thanks,for sharing the insight.
    It was useful

    ReplyDelete