Subsection 1.3.1 What is Source Sode
Source code is a set of instructions for computers that is meant to be read and written by humans.
Here’s an example of source code, in the C programming language, for a simple, but complete, program.
In order to run this program, it must be compiled into machine code using a program called a compiler. First, we save the program into a file called hello.c. Then, we compile it:
gcc -o hello hello.c
The compilation command is gcc, which stands for “GNU Compiler Collection.” The flag -o sets the name of the program that we are about to generate; here, we’ve decided to call it hello. The last argument is the name of the source file that we want to compile (hello.c). After compiling the program, you should be able to run it. To run the program, type the following at the prompt:
./hello
This says “run the program called hello that is in the current directory.” When run, this program will print "Hello World!" 100 times.
At this point, you have two files in your directory: hello.c, the source code, and hello, the program binary. That binary is a piece of that machine code. You can open it with a program called hexdump that will let you see the binary in a hexadecimal form. You can do this yourself on the command line:
hexdump hello
We’ve reproduced some of what it looks like when hello is viewed in hexdump after hello.c has been compiled by gcc:
0000000 457f 464c 0101 0001 0000 0000 0000 0000
0000010 0002 0003 0001 0000 8300 0804 0034 0000
0000020 0820 0000 0000 0000 0034 0020 0008 0028
0000030 001e 001b 0006 0000 0034 0000 8034 0804
0000040 8034 0804 0100 0000 0100 0000 0005 0000
0000050 0004 0000 0003 0000 0134 0000 8134 0804
0000060 8134 0804 0013 0000 0013 0000 0004 0000
0000070 0001 0000 0001 0000 0000 0000 8000 0804
0000080 8000 0804 0518 0000 0518 0000 0005 0000
0000090 1000 0000 0001 0000 0518 0000 9518 0804
00000a0 9518 0804 00fc 0000 0104 0000 0006 0000
00000b0 1000 0000 0002 0000 052c 0000 952c 0804
00000c0 952c 0804 00c8 0000 00c8 0000 0006 0000
00000d0 0004 0000 0004 0000 0148 0000 8148 0804
00000e0 8148 0804 0044 0000 0044 0000 0004 0000
00000f0 0004 0000 e550 6474 04a4 0000 84a4 0804
0000100 84a4 0804 001c 0000 001c 0000 0004 0000
0000110 0004 0000 e551 6474 0000 0000 0000 0000
0000120 0000 0000 0000 0000 0000 0000 0006 0000
0000130 0004 0000 6c2f 6269 6c2f 2d64 696c 756e
0000140 2e78 6f73 322e 0000 0004 0000 0010 0000
0000150 0001 0000 4e47 0055 0000 0000 0002 0000
0000160 0006 0000 0012 0000 0004 0000 0014 0000
0000170 0003 0000 4e47 0055 ac29 394b 26bf 01f1
0000180 e396 f820 3c24 f98c 8c5a 8909 0002 0000
0000190 0004 0000 0001 0000 0005 0000 2000 2000
00001a0 0000 0000 0004 0000 4bad c0e3 0000 0000
00001b0 0000 0000 0000 0000 0000 0000 0001 0000
00001c0 0000 0000 0000 0000 0020 0000 002e 0000
00001d0 0000 0000 0000 0000 0012 0000 0029 0000
00001e0 0000 0000 0000 0000 0012 0000 001a 0000
00001f0 848c 0804 0004 0000 0011 000f 5f00 675f
That’s only a small chunk of the program binary. The full binary is much larger — even though the source code that produces this binary is only two lines long.
As you can see, there’s a huge difference between source code, which is intended to be read and written by humans, and binary code, which is intended to be read and written by computer processors.
This difference is a crucial one for programmers who need to modify a computer program. Let’s say you wanted to change the program to say “Open source is awesome!!!”. With access to the source code, making this change is trivial, even for a novice programmer. Without access to the source code, making this change would be incredibly difficult. And this for two lines of code.
Subsection 1.3.2 On Decompilation
Compilation transforms a high-level language into a low-level language like machine code or byte code. Decompilation is the reverse. The process of reverse engineering a program from its binary code to source code is called decompiling, and the level of difficulty of decompiling a program differs rather significantly by the programming language of the source code. C code is notoriously difficult to decompile, while Java is much easier, partly because of far fewer compiler optimizations and partly because metadata, including details such as data types and method names, remain intact during compilation.
The challenges of decompilation and the potential security risks inherent in programs that are closed source led the Research Directorate of the US National Security Agency (NSA) to develop a suite of open source reverse engineering and decompilation tools called
Ghidra. That this work was developed by one of the top cybersecurity agencies in the world and that they released and maintain it as open source software stands as a very strong testimonial to both the potential security concerns inherent in closed source software and to the potential security benefits of open source software.
Note that copyright law generally provides the copyright holder with a set of exclusive rights. Decompilation is carried out by making copies of the software program, and so to be in compliance with law, one must consider the copyright before undertaking decompilation. Laws in both the United States and Europe do permit decompilation to a limited extent and for limited specific purposes. Laws in other areas of the world will vary, but reverse engineering a program for the purpose of reselling it under a different name and license would likely open you up to legal battles – the authors of this text strongly caution you against such action.
Prerequisite: Prior knowledge of programming in the C language.
In order to successfully complete the exercises below, it is important to have a reading-level understanding of the C programming language. This includes knowledge of the syntax, basic programming concepts, and the ability to modify and compile C source code.
Checkpoint 1.3.2. Optional (Very Difficult) Exercise – Change the binary code.
Change the binary code to print out “OSS Rocks!” instead of “Hello World!”. Spend no more than half a day on this exercise.
This is actually a very tricky exercise, and it could take you a fair bit of time. We included it here because you might be curious and want to go poking around in the binary. Under most flavors of Linux you should be able to find or install a program called hexedit. To get you started, you use TAB to switch from hex to ASCII, / to search, and F2 to save your changes. You can read the rest of the documentation for hexedit by reading the manpage, which you can get to by typing man hexedit
on the command line, or pressing F1 while running hexedit.