Loading an ELF

In this tutorial, you will be guided through loading an ELF object file.

Executable and Linkable Format (ELF) is the file format Linux and most other Unix-descended operating systems use to store native code. It contains the code and data for one executable or library, as well as metadata detailing what the program loaders need to do to set up that code and data within a process.

Consider the example tests/elf/elf.amd64.elf.s:

    .text
    .globl _start
    .type _start, @function
_start:
    # Load argc
    mov     0x8(%rsp),%rdi

    # If argc != 2, leave.
    cmp     $2,%rdi
    jne     .L2

    # Load argv
    mov     0x10(%rsp),%rdi
    # Load argv[1]
    mov     0x8(%rdi),%rdi

    
    mov     $0,%rax
.L3:
    # for(i = 0; argv[1][i] != '\0'; i++);
    cmpb    $0,(%rdi,%rax)
    je      .L1
    add     $1,%rax
    jmp     .L3
    
.L2:
    # Failure; return -1
    mov     $-1,%rax
.L1:
    # Leave, by any means necessary
    ret
    .size _start, .-_start

This is a very simple, not totally correct program that will perform the equivalent of strlen on argv[1]. You can build it into elf.amd64.elf by running the following:

cd smallworld/tests
make elf/elf.amd64.elf

Warning

Unlike previous tests, this requires as to assemble.

This will only work correctly on an amd64 host; on another platform, as will target the wrong architecture.

In order to harness code contained in an ELF, we need to do at least the following:

  • Follow the metadata in the ELF to unpack the memory image inside

  • Set execution to start at the correct place

Using the ELF Loader

SmallWorld includes a model of the basic features of the Linux kernel’s ELF loader. To exercise it, you need to use Executable.from_elf(), described in Memory Objects.

ELFs can contain code that’s intended to be loaded at a specific position, or that can be loaded at any address (position-independent). If our example is position independent, we will need to specify a load address.

Let’s take a look at our example, using the command readelf -l elf.amd64.elf

$ readelf -l elf.amd64.elf

Elf file type is EXEC (Executable file)
Entry point 0x1001120
There are 4 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000001000040 0x0000000001000040
                 0x00000000000000e0 0x00000000000000e0  R      0x8
  LOAD           0x0000000000000000 0x0000000001000000 0x0000000001000000
                 0x0000000000000120 0x0000000000000120  R      0x1000
  LOAD           0x0000000000000120 0x0000000001001120 0x0000000001001120
                 0x000000000000002f 0x000000000000002f  R E    0x1000
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000001000000  RW     0x0

 Section to Segment mapping:
  Segment Sections...
   00     
   01     
   02     .text 
   03

This lists the different “program headers”, or instructions to the OS describing how to load the program. The LOAD headers define blocks of memory allocated in the process.

The fact that the segment for offset zero is at address 0x400000 tells us that this is a fixed-position ELF. We do not need to provide a load address when calling Executable.from_elf(), and we need to avoid memory around address 0x400000, or risk clobbering our ELF.

binpath = "elf.amd64.elf"
with open(binpath, "rb") as f:
    # from_elf() needs to take the file handle as an argument.
    #
    # The "platform" argument is optional;
    # if absent, the elf loader will infer the platform from the ELF header.
    # if present, the elf loader will error if the ELF is for a different platform.
    code = smallworld.state.memory.code.Executable.from_elf(
        f, platform=platform
    )
    machine.add(code)

Finding the entrypoint

We need to find the start of our code within the ELF.

In our assembly, the code is contained in the _start function symbol. This is a special symbol for Linux programs; it defines the program entrypoint, which is exposed in the ELF metadata, and can be accessed via ElfExecutable.entrypoint.

entrypoint = code.entrypoint
cpu.rip.set(entrypoint)

Adding Bounds

We can use an ELF’s metadata to identify executable regions of memory, and put them “in-bounds” for emulation.

This does not happen automatically, since a harness may want to restrict execution to a narrower subset of memory than “everything executable in the ELF.”

Here, we are fine defining all code in the ELF as in-bounds:

for bound in code.bounds:
    machine.add_bound(bound[0], bound[1])

Putting it all together

Combined, this can be found in the script tests/elf/elf.amd64.py:

import logging
import sys

import smallworld

# Set up logging and hinting
smallworld.logging.setup_logging(level=logging.INFO)

# Define the platform
platform = smallworld.platforms.Platform(
    smallworld.platforms.Architecture.X86_64, smallworld.platforms.Byteorder.LITTLE
)

# Create a machine
machine = smallworld.state.Machine()

# Create a CPU
cpu = smallworld.state.cpus.CPU.for_platform(platform)
machine.add(cpu)

# Load and add code into the state
filename = (
    __file__.replace(".py", ".elf")
    .replace(".angr", "")
    .replace(".panda", "")
    .replace(".pcode", "")
)
with open(filename, "rb") as f:
    code = smallworld.state.memory.code.Executable.from_elf(f, platform=platform)
    machine.add(code)

# Set entrypoint from the ELF
if code.entrypoint is None:
    raise ValueError("ELF has no entrypoint")
cpu.rip.set(code.entrypoint)

# Create a stack and add it to the state
stack = smallworld.state.memory.stack.Stack.for_platform(platform, 0x2000, 0x4000)
machine.add(stack)

# Push a string onto the stack
string = sys.argv[1].encode("utf-8")
string += b"\0"
string += b"\0" * (16 - (len(string) % 16))

stack.push_bytes(string, None)
str_addr = stack.get_pointer()

# Push argv
stack.push_integer(0, 8, None)  # NULL terminator
stack.push_integer(str_addr, 8, None)  # pointer to string
stack.push_integer(0x10101010, 8, None)  # Bogus pointer to argv[0]

# Push address of argv
argv = stack.get_pointer()
stack.push_integer(argv, 8, None)

# Push argc
stack.push_integer(2, 8, None)

# Push fake return value
# This should be an exit point
exitpoint = code.entrypoint + code.get_symbol_size("_start") - 4
machine.add_exit_point(exitpoint)
stack.push_integer(exitpoint, 8, None)

# Configure the stack pointer
sp = stack.get_pointer()
cpu.rsp.set(sp)

# Emulate
emulator = smallworld.emulators.UnicornEmulator(platform)

# Use code bounds from the ELF
emulator.add_exit_point(0)
for bound in code.bounds:
    machine.add_bound(bound[0], bound[1])

machine.emulate(emulator)

Here, we load the code from our ELF and set the program counter to the entrypoint. We also configure a stack with the expected argc/argv layout, and set rdi and rsi equal to argc and argv respectively.

We halt execution before the final return (which won’t work), and read out the result from rax.

Here is what running the harness looks like:

$ python3 elf.amd64.py foobar
[+] starting emulation at 0x1001120
[+] emulation complete

Since “foobar” is length six, we have harnessed elf.amd64.elf completely.