I’ve been on a quest over the last year or so to understand fully how a programends up going from your brain into code, from code into an executable and froman executable into an executing program on your processor. I like the point I’vegot to in this pursuit, so I’m going to brain dump here :)
Prerequisite Knowledge: Some knowledge of assembler will help. Someknowledge of processors will also help. I wouldn’t call either of thesenecessary, though, I’ll try my best to explain what needs explaining. What youwill need, though, is a toolchain. If you’re on Ubuntu, hopefullythis article will help.If you’re on another system, Google for “[your os] build essentials”, e.g. “archlinux build essentials”.
You have an idea for a program. It’s the best program idea you’ve ever had soyou quickly prototype something in C:
#include <stdio.h>int main(int argc, char* argv[]) { printf("Hello, world!\n"); return 0;}
A work of genius. You quickly compile and run it to make sure all is good:
$ gcc hello.c -o hello$ ./helloHello, world!
Boom!
But wait… What has happened? How has it gone from being quite anunderstandable high level program into being something that your processor canunderstand and run. Let’s go through what’s happening step by step.
GCC is doing a tonne of things behind the scenes in the gcc hello.c -o hello
command. It is compiling your C code into assembly, optimising lots in theprocess, then it is creating “object files” out of your assembly (usually in aformat called ELF on Linux platforms), then it is linking those object filestogether into an executable file (again, executable ELF format). At this pointwe have the hello
executable and it is in a well-known format with lots ofcross-machine considerations baked in.
After we run the executable, the “l(fā)oader” comes into play. The loader figuresout where in memory to put your code, it figures out whether it needs to messabout with any of the pointers in the file, it figures out of the file needs anydynamic libraries linked to it at runtime and all sorts of mental shit likethat. Don’t worry if none of this makes sense, we’re going to go into it in goodtime.
This is a difficult bit of the process and it’s why compilers used to cost youan arm and a leg before Stallman came along with the Gnu Compiler Collection(GCC). Commercial compilers do still exist but the free world has standardisedon GCC or LLVM, it seems. I won’t go into a discussion as to which is betterbecause I honestly don’t know enough to comment :)
If you want to see the assembly output of the hello.c
program, you can run thefollowing command:
$ gcc -S hello.c
This command will create a file called hello.s
, which contains assemblycode. If you’ve never worked with assembly code before, this step is going to bea bit of an eye opener. The file generated will be long, difficult to read andprobably different to mine depending on your platform.
Now is not the time or place to teach assembly. If you want to learn,this bookis a brilliant place to start. I will, however, point out a little bit ofweirdness in the file. Do you see stuff like this?
EH_frame0:Lsection_eh_frame:Leh_frame_common:Lset0 = Leh_frame_common_end-Leh_frame_common_begin .long Lset0Leh_frame_common_begin: .long 0 .byte 1 .asciz "zR" .byte 1 .byte 120 .byte 16 .byte 1 .byte 16 .byte 12 .byte 7 .byte 8 .byte 144 .byte 1 .align 3
I was initially curious as to what this was as well, so I checked out stackoverflow and came across a really great explanation of what this bit means,which you can readhere.
Also, notice the following:
callq _puts
The assembly program is calling puts
instead of printf
. This is an exampleof the kind of optimisation GCC will do for you, even on the default level of“no optimisation” (-O0
flag on the command line). printf
is areally heavyfunction, due to having to deal with a large range of format codes. puts
isfar less heavy. I could only find the NetBSD version of it. puts
itself isvery small and it delegates to __sfvwrite
, the code of which ishere.If you want more information on how GCC will optimise printf
,this is a greatarticle.
Also, if assembler is a bit new to you, a few things to note is that this postis using GAS (Gnu Assembler) syntax. There are different assemblers out there, alot of people like the Netwide Assembler (NASM) which has a more human friendlysyntax.
GAS suffixes its commands with a letter that describes what “word size” we’redealing with. Above, you’ll see we used callq
. The q
stands for “quad”,which is a 64bit value. Here are other suffixes you may run in to:
By comparison, turning assembly instructions into machine code is pretty simple.Compiling is a much more difficult step than assembling is. Assemblyinstructions are often a 1 to 1 mapping into machine code.
At the end of the assembling stage, you would expect to have a file that justcontained binary instructions right? Sadly that’s not quite the case. Theprocessor needs to know a lot more about your code than just the instructions.To facilitate passing this required meta-information there are a variety ofbinary file formats. A very common one in *nix systems is ELF: executablelinkable format.
Your program will be broken up into lots of sections. For example, a sectioncalled .text
contains your program code. A section called .bss
storesstatically initialised variables (globals, essentially), that are not given astarting value, thus get zeroed. A section called .strtab
contains a list ofall of the strings you plan on using in your program. If you staticallyinitialise a string anywhere, it’ll go into the .strtab
section. In ourhello.c
example, the string "Hello, world!\n"
will go into the .strtab
.
This article, from issue 13 of LinuxJournal in 1995, gives a really good overview of the ELF format from one of thepeople who created it. It’s quite in depth and I didn’t understand everything hesaid (still not sure on relocations), but it’s very interesting to see themotivations behind the format.
Coming back from the previous tangent, let’s think about linking. When youcompile multiple files, the .c
files get compiled into .o
files. When Ifirst started doing C code, one thing that continuously baffled me was how a.c
file referenced a function in another .c
file. You only reference .h
files in a .c
file, so how did it know what code to run?
The way it works is by creating a symbol table. There are a multitude of typesof symbols in an executable file, but the general gist is that a symbol is anamed reference to something. The nm
utility allows you to inspect anexecutable file’s symbol table. Here’s some example output:
$ nm hello0000000100001048 B _NXArgc0000000100001050 B _NXArgv0000000100001060 B ___progname0000000100000000 A __mh_execute_header0000000100001058 B _environ U _exit0000000100000ef0 T _main U _puts0000000100001000 d _pvars U dyld_stub_binder0000000100000eb0 T start
Look at the symbols labelled with the letter U
. We have _exit
, _puts
anddyld_stub_binder
. The _exit
symbol is operating system specific and will bethe routine that knows how to return control back to the OS once your programhas finished, the _puts
symbol is very important for our program and exists inwhatever libc we have, and dyld_stub_binder
is an entry point for resolvingdynamic loads. All of these symbols are “unresolved”, which means if you try andrun the program and no suitable match is found for them, your program will fail.
So when you create an object file, the reason you include the header is becauseeverything in that header file will become an unresolved symbol. The process oflinking multiple object files together will do the job of finding theappropriate function that matches your symbol and link them together for thefinal executable created.
To demonstrate this, consider the following C file:
#include <stdio.h>extern void test(void);int main(int argc, char* argv[]) { printf("Hello, world!\n"); return 0;}
Compiling this file into an object file and then inspecting the contents willshow you the following:
$ gcc -c hello.c$ nm hello.o0000000000000050 r EH_frame0000000000000003b r L_.str0000000000000000 T _main0000000000000068 R _main.eh U _puts U _test
We now have an unresolved symbol called _test
! The linker will expect to findthat somewhere else and, if it does not, will throw a bit of a hissy fit. Tryingto link this file on its own complains about 2 unresolved symbols, _test
and_puts
. Linking it against libc complains about one unresolved symbol, _test
.
Unfortunately, because we don’t actually have a definition for test()
we can’tuse it. This may sound confusing, seeing as we defer the linking of puts()
until runtime. Why can’t we just do the same with test()
? Build an executablefile and let the loader/linker try and figure it out at runtime?
In the linking process you need to specify where the linker will be able tofind things on the target system. Let’s step through the original hello.c
example, doing each of the compilation steps ourself:
$ gcc -c hello.c
This creates hello.o
with an unresolved _puts
symbol.
$ ld hello.o
This craps out. We need to give it more information. At this point I’m going tomention that I’m on a Mac system and am about to reference libraries that havedifferent names on a Linux system. As a general rule here, you can replace the.dylib
extension with .so
:
$ ld hello.o /usr/lib/libc.dylib
This still craps out. Check out this error message:
ld: entry point (start) undefined. Usually in crt1.o for inferredarchitecture x86_64
What the hell? This is a really good error to come across and learn about,though. It leads us nicely into the next section.
Wait, didn’t we finish the last section with an object file that wouldn’t linkfor some arcane reason? Yes, we did. But getting to a point where we cansuccessfully link it requires us to know a little bit more about how our programstarts running when it’s loaded into memory.
Before every program starts, the operating system needs to set things up for it.Things such as a stack, a heap, a set of page tables for accessing virtualmemory and so on. We need to “bootstrap” our process and set up a goodenvironment for it to run in. This setup is usually done in a file calledcrt0.o
.
When you started learning programming and you used a language that got compiled,one of the first things you learned was that your program’s entry point ismain()
right? The true story is that your program doesn’t start in main, itstarts in start
. This detail is abstracted away from you by the OS and thetoolchain, though, in the form of the crt0.o
file.
The osdev wiki shows agreat example of a simple crt0.o
file that I’ll copy here:
.section .text.global _start_start: # Set up end of the stack frame linked list. movq $0, %rbp pushq %rbp movq %rsp, %rbp # We need those in a moment when we call main. pushq %rsi pushq %rdi # Prepare signals, memory allocation, stdio and such. call initialize_standard_library # Run the global constructors. call _init # Restore argc and argv. popq %rdi popq %rsi # Run main call main # Terminate the process with the exit code. movl %eax, %edi call exit
07/08/2013 UPDATE: In a previous version of this post I got this bit totallywrong, confusing the 32bit x86 calling convention with the x86-64 callingconvention. Thanks to Craig in the comments for pointing it out :) The belowshould now be correct.
The line that’s probably most interesting there is where main
is called. Thisis the entry point into your code. Before it happens, there is a lot of setup.Also notice that argc
and argv
handling is done in this file, but it assumesthat the loader has pushed the values into registers beforehand.
Why, you might ask, do argc
and argv
live in %rsi
and %rdi
before beingpassed to your main function? Why are those registers so special?
The reason is something called a “calling convention”. This convention detailshow arguments should be passed to a function call before it happens. The callingconvention in x86-64 C is a little bit tricky but the explanation (taken fromhere)is as follows:
Once arguments are classified, the registers get assigned (in left-to-rightorder) for passing as follows:
- If the class is MEMORY, pass the argument on the stack.
- If the class is INTEGER, the next available register of the sequence %rdi,%rsi, %rdx, %rcx, %r8 and %r9 is used
For example, take this C code:
void add(int a, int b) { return a + b;}int main(int argc, char* argv[]) { add(1, 12); return 0;}
The assembler that would call that function goes something like this:
movq $1, %rdimovq $12, %rsicall add
The $12
and $1
there are the literal, decimal values being passed to thefunction. Easy peasy :) The convention isn’t something that needs to befollowed in your own assembly code. You’re free to put arguments wherever youwant, but if you want to interact with existing library functions then you needto do as the Romans do.
With all of this said and done, how do we correctly link and run our hello.o
file? Like so:
$ ld hello.o /usr/lib/libc.dylib /usr/lib/crt1.o -o hello$ ./helloHello, world!
Hey! I thought you said it was crt0.o
? It can be… crt1.o
is a file withexactly the same purpose but it has more in it. crt0.o
didn’t exist on mysystem, only crt1.o
did. I guess it’s an OS decision.Here’sa short mailing list post that talks about it.
Interestingly, inspecting the symbol table of the executable we just linkedtogether shows this:
$ nm hello0000000000002058 B _NXArgc0000000000002060 B _NXArgv U ___keymgr_dwarf2_register_sections0000000000002070 B ___progname U __cthread_init_routine0000000000001eb0 T __dyld_func_lookup0000000000001000 A __mh_execute_header0000000000001d9a T __start U _atexit0000000000002068 B _environ U _errno U _exit U _mach_init_routine0000000000001d40 T _main U _puts U dyld_stub_binder0000000000001e9c T dyld_stub_binding_helper0000000000001d78 T start
The reason is that .dylib
and .so
files (they have the same job, but on Macthey have the .dylib
extension and probably a different internal format) aredynamic or “shared” libraries. They will tell the linker that they are to belinked dynamically, at runtime, rather than statically, at compile time. Thecrt*.o
files are normal objects, and link statically which is why the start
symbol has an address in the above symbol table.
You return a number from main()
and then your program is done, right? Notquite. There is still a lot of work to be done. For starters, your exit codeneeds to be propagated up to any parent processes that may be anticipating yourdeath. The exit code tells them something about how your program finished.Exactly what it tells them is entirely up to you, but the standard is that 0means everything was okay, anything non-zero (up to a max of 255) signifies thatan error occurred.
There is also a lot of OS cleanup that happens when your program dies. Thingslike tidying up file descriptors and deallocating any heap memory you may haveforgotten to free()
before you returned. You should totally get into the habitof cleaning up yourself, though!
So that’s about the extent of my knowledge on how your code gets turned into arunning program. I know I missed some bits out, oversimplified some things and Iwas probably wrong in places. If you can correct me on any point, or haveanything illuminating about how non-x86 or non-ELF systems do the above tasks, Iwould love to have a discussion about it in the comments :)
聯(lián)系客服