# From C to exec

As the idea of having a technical blog grew in my mind and came closer to reality, it became obvious that it needs to host some kind of useful content apart from fluff and opinions. For this, I asked a few of my many technical friends:

> What can I write about for you?

The suggestion that I liked the most was to write about the process that turns a C source file into an executable. The topic seems quite opaque, and indeed I don't recall seeing many ways to get into it without already having some background knowing about linking, translation units, and symbols.

The goal of this post is to establish this background for a reader who has a passing familiarity with C.

The seasoned reader should be warned that this is not a rigorous summary of the C standard, but rather a gentle introduction to the practical world of linking ELF files on GNU/Linux with GCC. Simplicity prevails, so while the C standard gives some room for different results, we'll explore only the case of unoptimized code and GCC version 9.

## Act 1: Hello World

If you've dealt with C before, you're probably familiar with the `hello.c` file.

```
#include <stdio.h>
int main(void) {
  puts("Hello World!");
  return 0;
}
```

The process of turning code into an executable usually consists of at least 2 main steps: *compilation* and *static linking*. Depending on how you've usually done your programming, you may be familiar with different ways of performing those two steps. Let's take a look at two most popular ones.

### Scene 1: Executable

This unsuspecting source file is about to undergo transformations that push it through the entire pipeline, changing it beyond recognition into a dynamic executable. Prepare your terminals:

```
# gcc hello.c -o hello
# ./hello
Hello World!
# file hello
hello.o: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=c1b39511769ab152611cd8ad83983dc724333b7a, not stripped
```

Congratulations, we've turned code into an excutable!

### Scene 2: Linking

But wait, wasn't this supposed to be 2 steps? Indeed, gcc does them both at the same time when not passed the `-c` flag. Let's try again:

```
# gcc -c hello.c -o hello.o
# ./hello.o
bash: ./hello.o: Permission denied
# file hello.o
hello.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped
```

That's not the same kind of file as before! What we got is an object file instead: a compiled version of the source code, without any linking applied. Notice the "relocatable" instead of "executable".

Let's link it now:

```
# gcc hello.o -o hello
# ./hello
Hello World!
```

That's an executable again!

You may wonder: what happened? how is the intermediate file different from the executable? why is it useful? Let's move on to the second act...

## Act 2: Symbols

Here we discover why linking is useful and necessary.

Our project is growing more complicated. It's composed of two source files now.

`hello.c` morphs:

```
#include "print.h"
int main(void) {
  print_hello();
  return 0;
}
```

`printer.c` contains:

```
#include <stdio.h>
void print_hello(void) {
  puts("Hello World!");
}
```

Finally, they are joined by `print.h`:

```
void print_hello(void);
```

### Scene 1: Object files

This is a more complicated project. A function from one file is referenced in another! And there's a new header file as well. How do we turn this into an executable? Of course, we could just use `gcc` without the `-c` flag, and hope that it figures out everything on its own. But that's not why we're here, so let's go step by step:

This compiles the `printer.c` file into an object file `printer.o`:

```
gcc -c printer.c -o printer.o
```

So far so good. This produces the `hello.o` file, which is compiled `hello.c`:

```
gcc -c hello.c -o hello.o
```

Wait a minute… `hello.c` calls a function in `printer.c`, but neither the command, nor `hello.c` reference `printer.c` anywhere! All they have is the name `print_hello`, but not the body of the function. How does the compiler know what call to insert?

Turns out that the compiler doesn't actually need to resolve all names to function bodies immediately. What it does instead is use *symbols* to step around the problem. When a source file containing a function (or a global variable) named `print_hello` gets compiled, the resulting object file (or a library) gets a symbol called `print_hello`. On the other hand, any object file *using* that symbol creates a reference to it (in the *relocation table*), to be filled later: at linking time.

Let's see it for ourselves:

```
# nm printer.o
0000000000000000 T print_hello
                 U puts
```

According to `man nm`, `T` means the symbol is defined and in the *Text* section. What about the other object file?

```
# nm hello.o
0000000000000000 T main
                 U print_hello
```

As expected, `print_hello` is referenced, but `U` for "Undefined". It's simply not in this file.

To fill in the vacancy, let's do the final step, and link both files together:

```
# gcc hello.o print.o -o hello
# nm hello
[...]
0000000000401126 T main
0000000000401136 T print_hello
[...]
```

There are many more symbols, but the one we're interested in is there: `print_hello` is now present. Notice that `main` is present as well, meaning that we have the bodies of *both* of our source files in the same compiled file now.

The linker has performed *relocation*. If you remember from before, the *object file* type returned by the *file* command was `relocatable`. Our binary is now *executable*, meaning that we can't repeat the same operation again to add some symbols we forgot about any more. But the linker will not let you forget anything anyway:

```
# gcc hello.o -o hello
/usr/bin/ld: hello.o: in function `main':
hello.c:(.text+0x5): undefined reference to `print_hello'
collect2: error: ld returned 1 exit status
```

### Scene 2: Header files

The other anomaly introduced in this example is the header file. It contains only one line:

```
void print_hello(void);
```

What is it for? Suppose we didn't have it. The only information about `print_hello` would come from its call, which looks like this:

```
  print_hello();
```

Can you guess the type of the function? Well, sort of: it takes no arguments, but it may return anything. And if we made a mistake, like this:

```
  print_hello("world");
```

a compiler guessing the type of arguments would happily go on trusting what we wrote, instead of warning us of the extra argument. The line in the header file, called *prototype*, informs the compiler about which functions are available, and what type they have.

As an aside, you might notice that this was already partially covered by the linker: when function `print_hello` was unavailable in our example, we received am "undefined reference" message from the linker. Couldn't we get rid of headers and let the linker catch those errors? Indeed, gcc-c++ uses some additional information to store function type:

```
# cp printer.c printer.cpp
# g++ -c printer.cpp -o printerpp.o
# nm printerpp.o 
                 U puts
0000000000000000 T _Z11print_hellov
# c++filt _Z11print_hellov
print_hello()
```

This is called *symbol mangling*, and C compilers don't generally use it, leaving us with obligatory prototypes, and, in practice, header files.

Despite this, C++ still requires prototypes inside the header files, for reasons that are not clear to me. I suspect compatibility with C, or even IDE suggestions.

### Scene 3: Static libraries

Static libraries are used when additional functionality provided by third party is needed. They also provide symbols, and may depend on other libraries. If this sounds familiar, it's because it is! That's the same core functionality as object files.

In fact, if you look at a shared library on Linux (file name ending with `.a`), it's just an *ar* archive containing multiple object files.

```
# ar -t /usr/lib64/libm-2.29.a
s_lib_version.o
s_matherr.o
s_signgam.o
fclrexcpt.o
fgetexcptflg.o
fraiseexcpt.o
[...]
```

Let's turn our printer into a static library and see for ourselves:

```
# gcc -c -o printer.o printer.c
# ar rcs libprinter.a printer.o
# nm libprinter.a

printer.o:
0000000000000000 T print_hello
                 U puts
# gcc hello.o libprinter.a -o hello
# ./hello
Hello World!
```

## Intermission

Those steps are sufficient if you need to create a binary for the bare metal, for example for an Arduino with an AVR microcontroller, or if you're creating a basic operating system yourself. The executable needs to be provided in the final form, taking minimal advantage of libraries present on the system. Anything with an operating system permitting *dynamic (shared) libraries* (like Windows or Linux) is likely to have a 3rd step, which happens immediately before execution: *dynamic linking*. We will cover that in a future chapter, together with various gotchas related to how C emits symbols.

## Glossary

*compilation*: turning a *source* representation of a program into *machine code*

Written on .

Comments

dcz's projects

Thoughts on software and society.

Atom feed