Month: June 2020

Dynamic Linking

Static vs dynamic linking usually has to do with our tolerance for the size of our final executable. A static executable contains all code necessary to run the executable, so the operating system loads the executable into memory, and it’s off to the races. However, if we keep duplicating code over and over again, such as printf, then it starts to use up more and more space. So, a dynamic executable means that we only store stubs in the executable. Whenever we want to access printf, it goes out to a dynamic linker and loads the code essentially on demand. So, we sacrifice a tiny bit of speed for a much smaller executable.


Contents

  1. What is Linking
  2. Finding Symbols
  3. Static Libraries (archives)
  4. Dynamic Libraries (shared objects)
  5. Analyzing a Shared Program
  6. Unresolved Symbols at Run Time
  7. Procedure Linkage Table (plt)
  8. Global Offset Table (got)
  9. Loading and Updating the GOT
  10. Conclusion and Further Reading
  11. Video

What is Linking?

When I hear someone talk about compiling their program into an executable, they are really eliding over several stages. The definition of compiling is to produce something by combining information collected from different sources. In computing, we generally think of compiling as turning a higher-level language, such as C, into a lower-level code, such as assembly.

The final stage before we get an executable is the linking stage. This is where we link (hence the name) all sources together to produce one coherent executable. This is also where all outstanding symbols need to be resolved. Symbol is just fancy for the name of a function or global variable.

We can take a look at object code, which is what we get after we assemble but before we link. Here’s an object dump of an example program.

#include <stdio.h>
#include <stdlib.h>
void some_func(int a, int b);

int main(int argc, char *argv[]) {
	if (argc < 3) {
		printf("Not enough arguments.\n");
		return -1;
	}

	int a = atoi(argv[1]);
	int b = atoi(argv[2]);

	some_func(a, b);

	return 0;
}

This code produces the following object code, which I disassemble using objdump for RISC-V.

Notice that the function some_func has been prototyped, but has not been defined. This will be the responsibility of the linker to find the symbol some_func and add it into our program. Notice what happens when I try to link this program without ever defining some_func.

/opt/riscv_1/lib/gcc/riscv64-unknown-linux-gnu/9.2.0/../../../../riscv64-unknown-linux-gnu/bin/ld: /tmp/ccmgFc75.o:

in function .L2': test.c:(.text+0x62): undefined reference to some_func'
collect2: error: ld returned 1 exit status

The linker is looking for the symbol some_func, but it cannot find it, so we get an undefined reference. We know this is at the linker stage because the error is “ld returned 1 exit status”. The “ld” means “linker”.

We also can see that the address of the function main is 0, this is because we haven’t linked a program. So, our object code just contains bubbles of code, which will then be placed into our executable at certain locations by the linker.


Finding Symbols

If we use the command nm, which is used to list symbols in an object file, we can see all of the unresolved symbols. Our object code is looking for these, but it doesn’t need to know where they are until we have a full executable program.

You can see that in this object file, the linker has a little bit of work to do. It must find atoi, puts, and some_func, which are flagged as U for undefined symbols. When we execute the linker, we will specify certain libraries, such as -lc (the C library), which most of these symbols will be found. Our some_func has never been defined, so our linker cannot really succeed until we define it somewhere.


Static Libraries (archives)

Archive files generally end in .a and contain code and other information that will be added to the final executable. Essentially, archive files are just a collection of object files into one file. Therefore, when we link to an archive file, we extract the actual code from the object code into our executable. The nice thing about archive files is that we don’t need to add ALL symbols into our executable–we only need those symbols that need to be used.

Usually, we have archive versions of all libraries to allow for static executables. These executables are those that don’t need any additional loading to run. In other words, these are self-contained files. The linker stage will pull in the code directly from the .a file into the executable, and all is done. The more code that the linker pulls in, the larger the executable will be.

Let’s go ahead and define some_func and see how linking works.

#include <stdio.h>

void some_func(int a, int b)
{
	printf("Data = %d\n", a * b);
}

It’s very simple, but the point is to have some code that the linker can pull in. Recall that the linker gave us an “undefined reference” error because we didn’t define some_func. Now, we have it defined, so let’s see what happens.

riscv64-unknown-linux-gnu-gcc -static -o test test2.o test.o

Notice that I’m using gcc still. This will automatically invoke the linker and pull in the necessary start files. If we invoke the linker directly, we have to define _start and tell the program how to start, or we would have to directly specify the start files.

I specify -static so that gcc will pull in only .a files. When we are done, the file test will be a fully-contained, static executable. If we take a look at this file, we can see that all of the “stuff” we need for this executable makes it a fairly large file.

-rwxr-xr-x 1 smarz smarz 4687920 Jun 1 09:16 test

Yes, that’s 4,687,920 bytes or about 4.7 megabytes. However, if we looked at the symbol table, we will find that NO symbol is unresolved, and therefore this is a self contained executable. If we load this with our elf loader, no external resources will need to be pulled in.

Our linker must exhaustively follow every possible route and pull in those symbols even though they may never be called. We can see the symbol table is enormous due to all of the calls and global variable (such as errno).

00000000000101b0 T abort
000000000006bb38 S __abort_msg
0000000000029aa8 t add_alias2.isra.0.part.0
00000000000497c6 t add_fdes
00000000000297ea t add_module.isra.0
000000000003aa4e t add_name_to_object.isra.0
000000000003ab5c t add_path.isra.0.constprop.0
000000000006b8d0 d adds.8114
000000000002fe2e T __add_to_environ
00000000000431b6 t add_to_global
000000000001adb4 t adjust_wide_data
000000000006bba0 V __after_morecore_hook
0000000000012d3a t alias_compare
000000000002338a W aligned_alloc
000000000006bb80 s aligned_heap_area
000000000003ff04 t allocate_dtv
000000000003a40c T __alloc_dir
000000000006ca38 b any_objects_regis

Dynamic Libraries (shared objects)

Dynamic libraries end in .so, which stands for shared object. These libraries contain code that will not be added directly into the executable. Instead a program called the dynamic linker will be responsible for taking code from the .so files and adding them into an executing program. We can also add symbols ourselves using the -ldl (dynamic linker) library.

We can think of the term dynamic as run-time. That is, we don’t actually load the code into our program until the program actually runs. We can see this with something as simple as printf. We can examine our executable and we don’t see printf’s code. Instead, we see printf’s stub. This stub will then be replaced by the dynamic loader when the program executes.

When we link with a dynamic, shared object, those symbols that can be added at run time will remain unresolved. We will have the symbols’ names put into a table just so we know it’s out there somewhere. However, with shared objects, we now can get that unresolved reference (or symbol) at run-time! If you have arch-linux and have ever compiled anything yourself, you might’ve run into this phenomenon.


Making a Shared Library

Let’s go ahead and turn our test2 file into a shared object. This is fairly easy to do with gcc:

riscv64-unknown-linux-gcc -fPIC -shared -o libtest2.so test2.o

The switch -fPIC stands for position independent code. That means all offsets cannot be relative outside of the library itself, and this is usually the case for most shared libraries. In this case, the generated code will be placed into a table to find the offsets. This is required since the library, or specific symbols in the library, can be loaded into any location by the dynamic linker.

Then, we can compile our test program to link to this shared library:

riscv64-unknown-linux-gcc -o test test.o -L. -ltest2

In the command above, I am specify the library search path with -L, so -L. means to look in the current directory. Then, I specify the library to link with using -ltest2. GCC will automatically prepend lib and append .so to make libtest2.so.


Analyzing A Shared Program

First, we compile in the location of our library. We can see these shared libraries using the ldd command.

smarz@DOMO:~ $ ./test
./test: error while loading shared libraries: libtest2.so: cannot open shared object file: No such file or directory

So, when I run my program, it looks at a certain path to find your libraries. This is analogous to the PATH environment variable, except in Linux (and some UNIXes), we use LD_LIBRARY_PATH. Shown below, if I change my path so the dynamic linker can find my library, it functions properly:

smarz@DOMO:~ $ LD_LIBRARY_PATH=/home/smarz ./test 10 20
Data = 200

Unresolved at Run Time

When we link a program, many of the symbols will remain unresolved. This tells us what symbols the dynamic linker is responsible for loading from the shared objects (libraries). We can see these unresolved symbols using the nm command:

Partial list of symbols in our test program

You can see that puts is unresolved (capital U letter), and so is some_func, since we now put it in a shared library. These symbols have stubs that will eventually invoke the interpreter to load the actual code for that function.

The dynamic linker’s job is to resolve these symbols when we approach them. In fact, we can take a look at the produced ELF executable to see what it wants to use to resolve symbols.

You can see the INTERP section (interpreter), which requests the dynamic linker by name. If we run this dynamic linker, we can see its purpose.

To resolve these symbols, the linker will put in a loader in place of the actual function. When we run our program, some_func’s assembly instructions are no where to be found in our program. Instead, when we make a function call to some_func, it will refer us to the procedure linkage table (PLT). The procedure linkage table will then lookup where the program is loaded by referring to the global offset table (GOT). At first, the address in the GOT is the code that loads the program (the dynamic linker). After the dynamic linker loads our some_func code, it changes the GOT to reflect where it put some_func in memory.

A flow chart representing dynamically linked functions.

Procedure Linkage Table (PLT)

If we take a look at the functions and other symbols we’re going to use in our program, we notice that they’re very, very short, and they look almost identical to each other. This is because we have a stub, which loads the program from the procedure linkage table, PLT.

Our some_func loads from -1028(t3) and jumps to that location. The value of t3 is the global offset table (GOT), which I cover below. We can see that -1028(t3) is the address 12018, which is right in the meat of the GOT.

So, we go to the function, which then loads from the global offset table, which then points us to the procedure linkage table. In this procedure linkage table, if the procedure has been used before, the code for that procedure, such as puts or some_func, will be loaded into the table. Otherwise, it will be a stub function whose job it is to invoke the interpreter and load the code.


Global Offset Table (GOT)

The global offset table is a large table that contains offsets that we can load from. Here is the table in our test program.

Don’t worry about the assembly instructions–these aren’t instructions, but objdump is trying to disassemble them anyway.


Loading and Updating the GOT

Notice that all functions originally point to 0x0000_0000_0001_04a0, which goes back to the procedure linkage table. This is where the interpreter routine is located which loads the symbols. When the symbol is loaded by the interpreter, it will modify the global offset table to reflect where it loaded it into memory. We can look at this by directly referencing the memory being used.

Let’s modify our program to see what we get here:

#include <stdio.h>
#include <stdlib.h>
void some_func(int a, int b);

int main(int argc, char *argv[]) {
	if (argc < 3) {
		printf("Not enough arguments.\n");
		return -1;
	}

	int a = atoi(argv[1]);
	int b = atoi(argv[2]);
	volatile unsigned long *ptr = (unsigned long *)0x12018;

	printf("0x%016lx = 0x%016lx\n", ptr, *ptr);

	some_func(a, b);

	printf("0x%016lx = 0x%016lx (some_func is at 0x%016lx)\n", ptr, *ptr, some_func);

	return 0;
}

In our modified program, we look at 0x12018, which was where our some_func stub was located in the global offset table (remember some_func@plt referred us to this location in the GOT).

Notice that at first the global offset table points to 0x104e0, which is the code used to invoke the interpreter which in turn loads the symbols. Notice that when we load some_func, the global offset table has been updated with the memory address 0x400_0081_b460. This is where our dynamic linker loaded the code for our some_func function.

Since I updated the function, some of the offsets are no longer what they were for our original program. However, to show you that the some_func function itself points to the procedure linkage table, let’s take a look at the updated object dump:

When I print out some_func’s memory address, we get 0x10510, which is where it is located in the procedure linkage table. However, if we look at the table, notice that it grabs the address of the actual function from the global offset table. In this case, the stub is at 0x104e0. After some_func is loaded (after its first invocation), the global offset table is updated to reflect 0x400081b460, which is where the actual assembly instructions are located for some_func.


Conclusion

If we break down what’s happening here, there is no magic happening. For dynamic symbols, we just have a table that stores their addresses. The table’s location is fixed, so we can compile that directly into our program. However, what the table refers to is dynamic, so it can be updated by the dynamic linker ld.so with wherever it put our code. That’s it!


Video

Hardware Floating Point

This is an article in the larger tutorial The Adventures of OS: Making a RISC-V Operating System using Rust.

Up until now, we’ve only been using the integer instructions. However, now that we want to support user processes, we must be able to support the hardware floating-point unit (FPU). In RISC-V, this is fairly simple, but it can lead to some trouble if we’re not careful.

First, the floating-point unit must be controlled through the mstatus register–more specifically the FS bits (bits 14 and 13).

Recall that we used this register to change processor modes (MPP). However, now we care about this register to control the floating point unit. These two bits support four different states of the floating point unit:

FS[1:0]Description
00FPU off.
01FPU on, initial state.
10FPU on, no register changes since FPU was reset.
11FPU on, some registers have changed since FPU was reset.

If we try to load or store (fld or fsd) a floating point register while the FPU is off, it will trap to the operating system. So, we need to make sure we have the floating point unit turned on and ready to go before we do any work with it. We can add a bit in the switch_to_user function we wrote to turn on the FPU into the initial state whenever we context switch to another process.

.global switch_to_user
switch_to_user:
   csrw mscratch, a0
   ld a1, 520(a0)
   ld	a2, 512(a0)
   ld	a3, 552(a0)

   li	t0, 1 << 7 | 1 << 5 | 1 << 13

   slli a3, a3, 11
   or t0, t0, a3
   csrw mstatus, t0
   csrw mepc, a1
   csrw satp, a2
   li t1, 0xaaa
   csrw mie, t1
   la t2, m_trap_vector
   csrw mtvec, t2
   mv t6, a0

   .set i, 0
   .rept 32
   load_fp %i
   .set i, i+1
   .endr

   .set i, 1
   .rept 31
      load_gp %i, t6
      .set i, i+1
   .endr
   mret

The first thing we do is tack on 1 << 13, which sets the FS bits to 01, which is the initial state. Unlike MPP (machine previous privilege), this immediately takes effect. Then, we restore the registers from the TrapFrame structure. Conveniently, this follows the 32 general purpose, integer registers. You can see our macro load_fp gets expanded to the following:

.altmacro
.set NUM_GP_REGS, 32  # Number of registers per context
.set REG_SIZE, 8   # Register size (in bytes)
.macro load_fp i, basereg=t6
	fld	f\i, ((NUM_GP_REGS+(\i))*REG_SIZE)(\basereg)
.endm

We skip \(8\times 32=256\) bytes to get through all 32 of the GP registers. Now we’re at the register we want. Just like ld loads a memory location into a general purpose register, fld will load a memory location into a floating point register. REMEMEBER: the FPU must be turned on (\(\text{FS}\neq 0\)) at this point via the FS bits.

Restore or Save?

We use the FS bits to not only control the FPU, but to also see what happened. The RISC-V privileged specification recommends the following for the FPU during traps:

Handling Traps

The reason the floating-point unit has more than an on/off state is because it allows us to only save the registers if they’ve changed. If the user process never even used the FPU, why save those registers? However, the granularity of the floating point unit is such that if only 1 of the 32 registers changes, we have to save all 32 of them.

So, we need to add code that checks the state of the FPU to see if we need to save the registers. We can do this by reading the FS bit:

   csrr t1, mstatus
   srli t0, t1, 13
   andi t0, t0, 3
   li t3, 3
   bne t0, t3, 1f
   .set i, 0
   .rept 32
      save_fp %i, t5
      .set i, i+1
   .endr
1:

Above, we read the mstatus register, shift it right 13 places and mask it with 3, which is binary 11. This means we isolate the FS bits (2 bits) so we can read what the value is. If these two bits are NOT 11 (recall this means the registers were written to), then we skip saving the floating point registers and jump to the numeric label 1 forward (hence 1f).

To look at the TrapFrame structure, we see the following in Rust (cpu.rs):

#[repr(C)]
#[derive(Clone, Copy)]
pub struct TrapFrame {
   pub regs:   [usize; 32], // 0 - 255
   pub fregs:  [usize; 32], // 256 - 511
   pub satp:   usize,       // 512 - 519
   pub pc:     usize,       // 520
   pub hartid: usize,       // 528
   pub qm:     usize,       // 536
   pub pid:    usize,       // 544
   pub mode:   usize,       // 552
}

Our fregs (floating-point registers) follow the 32 general purpose registers. So, the only math we need to do is \(8\times 32=256\).

This is really it! We can use the floating point unit in the kernel, provided we turn it on before we use any of the registers. Again, if we try to use the FPU while the FS bits are 00, we will get a trap.

Video

Files By Name

This is part of a larger tutorial: The Adventures of OS: Making a RISC-V Operating System using Rust.

We wrote the code in our Minix 3 file system to read an inode and then read the direct, indirect, doubly-indirect, and triply-indirect pointers that point to the data relevant to the given inode. However, we were required to do a stat in order to see what inode we needed to get. Obviously, this isn’t how file systems actually work. Instead, we want to search for a given inode by its name.

Recall that in Minix 3, we have a directory entry structure, which in Rust looks as follows.

#[repr(C)]
pub struct DirEntry {
	pub inode: u32,
	pub name:  [u8; 60]
}

The code above shows that we have a 4-byte inode and a 60-byte name. This is how we link a name to an inode. In fact, we can have multiple names link to the same inode–these would be called hard links. This is also why we don’t have a delete filesystem function. Instead, it is called unlink. When we unlink, we break a name to inode linkage. Whenever there are no more links left, the file is considered deleted.

Where do we find these DirEntry structures? That’s fairly simple. Recall that whenever we read a file, we go to the 10 “zone” pointers. For a regular file, when we go to the ends of each pointer, we find the actual data that belongs to the file. However, if the file in question is a directory, which we would know by looking at the inode’s mode, then these pointers point to a bunch of DirEntry structures.

For a block size of 1,024 bytes, we have \(1,024 / 64 = 16\) total DirEntries per block. Recall that we have a 4-byte inode and a 60-byte name, which is why each structure takes 64 bytes.

Another issue we run into is searching the directory entries. This is a very expensive task since we only get 16 entries per block. Therefore, we need to cache the inodes. In other words, we need to put something in memory to make look-ups quicker. Fortunately, this is fairly straightforward to do.

Our game plan is to look for all directory entries in the file system and store them in a some format where lookups are quick. We don’t really care about insertion or deletion times, but we will be doing a ton of lookups. So, let’s add some code to cache inodes. We will start whenever we initialize a file system. We are given the block device that backs the file system. This way here, we can use our virtio block driver to read and write to it.

static mut MFS_INODE_CACHE: [Option<BTreeMap<String, Inode>>; 8] =
	[None, None, None, None, None, None, None, None];

We don’t quite have a virtual filesystem yet, so we’re doing much like a DOS based allocation system. So, essentially we are blockdevice:, such as 8: (remember in DOS, we had C:?). Then, whenever we get into that block device, we have our underlying file system. Currently, we cannot just mount one file system into the directory structure of another. However, with a VFS, we would have a single map that stores the entire file system. Instead of mapping strings to Inodes, we would need to map strings to some sort of structure that told us what block device to go to and then the Inode or some identifying value.

Notice that I’m using the BTreeMap here. This is built-in to Rust’s alloc::collections library, which then hooks to the global allocator we wrote a loooooooong time ago.

We will be storing the entries by path–this is our key. Then, we will store the Inode of that entry as the value. This should give us all the information we need.

So, now we need a game plan to actually cache the inodes. First, we need to know the directory structure. Recall that in Minix 3, we only know the name of the given file in the DirEntry structure. We don’t know the full path, so that’s something we will need to keep up with ourselves. So, here’s our init() function in Rust.

pub fn init(bdev: usize) {
	if unsafe { MFS_INODE_CACHE[bdev - 1].is_none() } {
		let mut btm = BTreeMap::new();
		let cwd = String::from("/");
		// Let's look at the root (inode #1)
		Self::cache_at(&mut btm, &cwd, 1, bdev);
		unsafe {
			MFS_INODE_CACHE[bdev - 1] = Some(btm);
		}
	}
	else {
		println!(
		         "KERNEL: Initialized an already initialized \
		          filesystem {}",
		         bdev
		);
	}
}

As you can see above, we allocate a string (String::from(“/”)), which is the root directory. Then, we look at each directory entry in the root. If we hit yet another directory, we recursively do the same thing until there are no more directories. So, our cache will store both directories and regular files–their paths and inodes, not the actual data.

As a side note, Rust requires that we add unsafe {} because this is a mutable static variable, meaning that it is global and can be changed, hence, it is not thread-safe unless we added locking.

So, let’s take a look at the function that actually does the work. This is the Self::cache_at function that you see above.

fn cache_at(btm: &mut BTreeMap<String, Inode>, cwd: &String, inode_num: u32, bdev: usize) {
   let ino = Self::get_inode(bdev, inode_num).unwrap();
   let mut buf = Buffer::new(((ino.size + BLOCK_SIZE - 1) & !BLOCK_SIZE) as usize);
   let dirents = buf.get() as *const DirEntry;
   let sz = Self::read(bdev, &ino, buf.get_mut(), BLOCK_SIZE, 0);
   let num_dirents = sz as usize / size_of::<DirEntry>();
   // We start at 2 because the first two entries are . and ..
   for i in 2..num_dirents {
      unsafe {
         let ref d = *dirents.add(i);
         let d_ino = Self::get_inode(bdev, d.inode).unwrap();
         let mut new_cwd = String::with_capacity(120);
         for i in cwd.bytes() {
            new_cwd.push(i as char);
         }
         // Add a directory separator between this inode and the next.
         // If we're the root (inode 1), we don't want to double up the
         // frontslash, so only do it for non-roots.
         if inode_num != 1 {
            new_cwd.push('/');
         }
         for i in 0..60 {
            if d.name[i] == 0 {
               break;
            }
            new_cwd.push(d.name[i] as char);
         }
         new_cwd.shrink_to_fit();
         if d_ino.mode & S_IFDIR != 0 {
            // This is a directory, cache these. This is a recursive call, which I don't really like.
            Self::cache_at(btm, &new_cwd, d.inode, bdev);
         }
         else {
            btm.insert(new_cwd, d_ino);
         }
      }
   }
}

What we do here is look at the inode that we’re initially given, which remember the root is number 1. We first read the given inode–the given inode should be a directory’s inode since we don’t call cache_at on regular files. Recall that each one of these inodes has blocks filled with the DirEntry structure (64 bytes).

If we find a directory, we recursively call cache_at with the new inode and new path. That will again search through all of the directory entries, and if there are any more directories, it will yet again recursively look through all of them. Otherwise, if we find a file, we just store the path, and we don’t recurse.

Pay no mind to me trying to figure out how to make Rust consider character arrays (u8 arrays more accurately) as strings. So, instead, I had to make a string and go character-by-character pushing each character from the array onto the string.

Using The Cache

Now that we’ve cached the inodes, it’s time to make it do something. Luckily, since BTreeMap is built-in to Rust, we can just use its .get() member function. If this function finds the key, it will return a Some(Inode). Otherwise, it will return None. We can use this to see if the file exists or not. So, let’s add some code to our open() filesystem function.

pub fn open(bdev: usize, path: &str) -> Result<Inode, FsError> {
   if let Some(cache) = unsafe { MFS_INODE_CACHE[bdev - 1].take() }
   {
      let ret;
      if let Some(inode) = cache.get(path) {
         ret = Ok(*inode);
      }
      else {
         ret = Err(FsError::FileNotFound);
      }
      unsafe {
         MFS_INODE_CACHE[bdev - 1].replace(cache);
      }
      ret
   }
   else {
      Err(FsError::FileNotFound)
   }
}

All we do here is grab exclusive access to the cache, look for the provided path, and then give back access to the cache. If we found the Inode, we copy it and return it wrapped in an Ok(). Otherwise, we wrap the cause of the error in an Err(). Notice I returned FileNotFound if we couldn’t get access to the cache. Perhaps this makes FNF ambiguous.

Now, to initialize our cache, we run init() in a process context. Then, we can grab the inode by using open(). Finally, if we want to read the data in it, we can use read(). Notice that the read() fs function has changed to accept an Inode reference. This helps keep things a little bit cleaner. This also gives us a flow: we cache inodes, we open a file to get its Inode structure, we read based on this Inode structure. For now, close doesn’t do anything because there is nothing really to close. For reads, we don’t care. For writes, the close would do something, such as make sure all writes have completed.


Video