Month: May 2020

Assembly’s Perspective of C

Posted to: https://blog.stephenmarz.com/2020/05/20/assemblys-perspective/

If you are living under a rock, quarantined against COVID-19, here’s quite a long, and somewhat technical, post for your consumption. This post will talk about how C and assembly dance together–the tango? waltz? who knows?

I usually write about operating systems, but I have not been able to write an operating system completely in one language. It probably can be done, but for me I write the bootloader and some utilities purely in assembly.


Contents

  1. How C Relates to Assembly
  2. Why Assembly?
  3. C Calling Conventions
  4. Register Sizes
  5. Caller vs Callee saved registers
  6. C and Assembly Code
  7. Pointers and References
  8. Structures
  9. Arrays
  10. Globals
  11. Conditions and loops
  12. Data types
  13. Conclusion

How C Relates to Assembly

I will be using the RISC-V architecture to describe how C translates into assembly. The nice thing about C is that there are only a fair amount of abstractions, so we can take a statement of C and translate it into assembly.

I teach my students to develop their logic in a higher-level language (mainly C++) and then translate into assembly. Assembly requires you to keep several things in your head at once, even to write a simple for loop. Luckily, there are some macros in assembly to make it a bit easier.

Let’s not forget that when we compile C code, we get assembly files. You can see the intermediate assembly file by using the -S switch to gcc. Otherwise, gcc will typically put the assembly file in the /tmp/ directory.

gcc -S test.c

Technically, C doesn’t have to compile into assembly, but as far as I know, nearly all compile into some lower-level language, and not directly into machine code (object code).

So, to get code into machine instructions, we have to follow the steps below:

An executable is a special type of machine code. Executables are usually in a specific format that the operating system can understand. However, inside of the executable is a section called the text section. This is where the machine code is stored. In Linux, this is called an executable and linkable format or ELF file. In Windows, this is called a portable executable or PE.


Why Assembly?

Assembly has many additional abstractions on top of it, including macros, labels, and pseudo-instructions. Yes, that’s right, assembly will do many things for you. However, the nice thing about assembly is that one instruction more-or-less is really one instruction (except maybe with pseudo-instructions). That is, we don’t let any compiler tell us how it wants to do things. If we want to make the addition operator (+) preferred over the multiplication operator (*), we can do it. Why would we do that? Uhh.. Ask someone else.

Another reason we use assembly over a high-level language is to use instructions that C or C++ cannot really use. For example, the advanced vector extensions (AVX) for an Intel processor don’t have a 1-to-1 conversion from C. Now, many compilers have done some work with this using what are known as intrinsics. (EDITED): These instrinsics are assembly wrappers on top of C. So, we can’t get by with C doing the hard part for us. Where as int i = 10 * z; will produce the assembly we need to get the result, the intrinsics are C-syntax for assembly instructions. So, all in all, we still need to know what the assembly is doing.

An optimizing compiler might be able to make some improvements with the assembly, but let’s take a look at two different ways to do matrix multiplication using SSE in Intel. One, we use the intrinsic instructions in C and in the other, we use assembly directly.

#include <stdio.h>
#include <pmmintrin.h>

void calc_intrin(float result[], float matrix[], float vector[]);
void calc_asm(float result[], float matrix[], float vector[]);

int main() {
   int row, col;
   float vec[] = {1.0, 10.0, 100.0, 1000.0};
   float mat[] = {2.0, 0.0, 0.0, 0.0,
		       0.0, 2.2, 0.0, 0.0,
		       0.0, 0.0, 22.2, 0.0,
		       0.0, 0.0, 0.0, 22.22};
   float result[4];

   calc_intrin(result, mat, vec);
   printf("%5.3f %5.3f %5.3f %5.3f\n", result[0], result[1], result[2], result[3]);

   calc_asm(result, mat, vec);
   printf("%5.3f %5.3f %5.3f %5.3f\n", result[0], result[1], result[2], result[3]);
   return 0;
}
void calc_intrin(float result[], float matrix[], float vector[])
{
   int row;
   __m128 vec = _mm_loadu_ps(vector);
   for (row = 0;row < 4;row++) {
      __m128 rowvec = _mm_loadu_ps(&matrix[row * 4]);
      __m128 rowvec2 = _mm_mul_ps(vec, rowvec);
      __m128 rowvec3 = _mm_hadd_ps(rowvec2, rowvec2);
      __m128 rowvec4 = _mm_hadd_ps(rowvec3, rowvec3);
      _mm_store_ss(&result[row], rowvec4);
   }
}

The code above multiplies a 4×4 matrix with a 4 element vector. However, if you look at how to multiply matrix by a vector, you see that we can parallelize each row, which is what I’m doing up here. So, first, we loadu (load unaligned) using ps, which stands for packed single. This means that it will grab 4, 32-bit floats and put them into a single 128-bit register. Then we multiply the matrix row with the vector. That’ll give us 4 elements all multiplied together. Then, we need to add all four elements together to get a single value, which is the result of the vector at that element. Each haddps instruction adds two elements together. To get all four elements added, we run the instruction twice.

You can see that in C, this looks fairly easy to do, but we still need knowledge of what the instruction is doing. If I write the assembly, as below, I can group certain operations and use the larger register set to my advantage. See below.

.intel_syntax noprefix
.section .text
.global calc_asm
calc_asm:
   movupd xmm0, [rsi + 0]
   movupd xmm1, [rsi + 16]
   movupd xmm2, [rsi + 32]
   movupd xmm3, [rsi + 48]

   movupd xmm4, [rdx]

   mulps xmm0, xmm4
   mulps xmm1, xmm4
   mulps xmm2, xmm4
   mulps xmm3, xmm4

   haddps xmm0, xmm0
   haddps xmm0, xmm0

   haddps xmm1, xmm1
   haddps xmm1, xmm1

   haddps xmm2, xmm2
   haddps xmm2, xmm2

   haddps xmm3, xmm3
   haddps xmm3, xmm3

   movss [rdi + 0], xmm0
   movss [rdi + 4], xmm1
   movss [rdi + 8], xmm2
   movss [rdi + 12], xmm3
   ret

The code above generates the following output:

Either way you slice it, we’re writing assembly either in a .S file or as the intrinsics, such as _mm_haddps.


C Calling Conventions

C has a standard for most architectures called the ABI, which stands for application binary interface. We use this standard so that any language can talk to another, as well as the architecture, provided they use the standard. Some things about the C calling conventions include what registers get what values whenever we call a function. Furthermore, it specifies which registers can be destroyed by the called function and which must be preserved. In RISC-V, the actual RISC-V specification has an Assembly Programmer section, which defines the following for the registers.

You can see that the registers have two names, one that starts with x and the other that starts with a, s, t, and some others. These are referring to the same registers, but x just gives you a number for that register. The a stands for argument, the t stands for temporary, and the s stands for saved. These are called ABI names for the registers, and then the description tells you what the ABI recommends each register should do. Finally, notice that there is a column called Saver. Again, we have to know which registers we’re responsible for.

Register Sizes

The load and store instructions can have one of four suffixes: b, h, w, or d, which stands for byte, halfword, word, and doubleword, respectively. A word in RISC-V is 32 bits or 4 bytes.

Each register in RISC-V is either 32 bits (for RV32) or 64 bits (for RV64). So, even though we load a byte from memory, it is still being stored into a bigger register. With the vanilla load and store instructions, the value will sign-extend from 8 bits to 64 bits.

Caller vs Callee Saved

Saving registers is important because we call both scanf and printf. A caller saved register means that when we call scanf and printf, we are NOT guaranteed that the value we put in that register is the same as when the function returns. Most argument and temporary registers follow this convention.

A callee-saved register means that when we call scanf and printf, we are guaranteed that the value we put in these registers is the same when scanf and printf return. This means that scanf and printf are allowed to use the saved registers, but they are required to restore their old values before they return.

Generally, all argument and temporary registers are free to be destroyed by a function, but all saved registers must be saved. For all of this to work, C must be aware of what to do with these registers.


C and Assembly Code

When I teach my students how to encode (from assembly to machine) and decode (from machine to assembly) instructions, it is easy to see that each instruction encodes into exactly 32-bits in RISC-V (not counting the compressed instructions). However, I’m not exactly sure how I would teach that using C. Let’s take a simple C program and see how it would translate into RISC-V assembly.

#include <stdio.h>
int main() {
   int i;
   scanf("%d", &i);
   printf("You gave %d\n", i);
   return 0;
}

This translates into RISC-V assembly as:

.section .rodata
scanf_string: .asciz "%d"
printf_string: .asciz "You gave %d\n"

.section .text
.global main
main:
   addi sp, sp, -16
   sd ra, 0(sp)
   la a0, scanf_string
   add a1, sp, 8
   call scanf
   lw a1, 8(sp)
   la a0, printf_string
   call printf
   ld ra, 0(sp)
   addi sp, sp, 16
   jr ra # same as ret instruction

The code above produces the following output. I gave 100 as an input for scanf:

First, in C, all strings are read-only global constants. The directive .asciz above means that we have an ASCII-table string with a zero on the end of it. This is how C builds a string. So, C, knows the starting memory address and goes character-by-character until it hits a zero. So, in other words, C-style-strings are arrays of characters.

.section .rodata
scanf_string: .asciz "%d"
printf_string: .asciz "You gave %d\n"

So, the portion above labels these strings as scanf_string and printf_string. A label is just like a variable–it’s just a nice name for a memory location.

.section .text
.global main
main:

The .text section is also called the code section, and it is in this section we place CPU instructions. The directive .global main makes the label, main, viewable outside of this assembly file. This is necessary because the linker will look for main. Then, we have a label called main. You can see that a function is just simply a memory label where we can find the CPU instructions for that given function.

In C, every function must have a unique name. So, a C function’s name directly corresponds with the assembly’s memory label. In this case, main is the label for int main().

addi sp, sp, -16
sd ra, 0(sp)

The sp register stores the stack pointer. The stack is memory storage for local variables. Whenever we allocate memory from the stack, we decrement. Then, we own everything between what we decremented and the original value of the stack pointer. So, in the case above, main now owns sp – 16 through (but not including) sp. Again, sp is just a memory address.

The instruction sd stands for store-doubleword. In this case, a double word is 64 bits or 8 bytes. We are storing a register called ra which stands for return address. There is only one RA register, so whenever we call another function, such as scanf and printf, it will get overwritten. The purpose of this register is so printf and scanf can find their way back to main. Since these functions can be called from anywhere, it must have a return path…hence the return address.

la a0, scanf_string
add a1, sp, 8
call scanf

Before we make a function call, we have to make sure the registers are setup before we make the call. Luckily, RISC-V makes this easy. The first argument goes into register a0, the second into a1, and so forth. For scanf, the first argument is a pointer to a string. So, we use the la (load address) instruction, which is actually a pseudo-instruction, to load the address of scanf_string into the first argument register, a0.

The second parameter is &i in C, which is the memory address where we want to store what we scan. In this case, i is a local variable, so we store it on the stack. Since bytes 0, 1, 2, …, 7 are being taken by the RA register, the next available offset we have is 8. So, we can add sp + 8 to get the memory location where we can put the value we scan.

lw a1, 8(sp)
la a0, printf_string
call printf

The code above loads a word, which is 32 bits or 4 bytes, from sp + 8, which recall is where scanf scanned our integer. This goes into the second argument, a1. Then, we load the address of the printf_string and store that memory address into the register a0.

Pointers and References

What do pointers and references mean in assembly? In C, we know that a pointer is an 8 or 4-byte memory value that stores a memory address. So, it’s like inception where a memory address stores a memory address. In assembly, we get a memory address inside of our argument registers instead of a value. Here’s an example.

void get_input(int *value) {
   printf("Enter a value: ");
   scanf("%d", value);
}

Recall that the arguments are stored in a0, a1, …, a7. So, the int *value will be in a0.

.section .rodata
printf_string: .asciz "Enter a value: "
scanf_string: .asciz "%d"

.section .text
.global get_input
get_input:
   addi sp, sp, -16
   sd ra, 0(sp)
   sd a0, 8(sp)
   la a0, printf_string
   call printf
   la a0, scanf_string
   ld a1, 8(sp)
   call scanf
   ld ra, 0(sp)
   addi sp, sp, 16
   ret

Notice that we store a0 into 8(sp). This is a memory location. So, when we ld a1, 8(sp) for the scanf call, a1 will be populated with the memory address that value points to, and hence, scanf will store the value inside of there. Notice that even though it is an integer pointer, we use ld and sd (doubleword). This is because we’re not storing the value, we’re storing the memory address, which on a 64-bit machine is always 64 bits (8 bytes).

To demonstrate pointers, lets do something simple, except let’s get a reference instead of a value:

int work(int *left, int *right) {
   return *left + *right;
}

While we pass addresses around, nothing really changes. However, when we dereference to get the value out of the memory address, we must use a load instruction.

.section .text
.global work
work:
   # int *left is in a0
   # int *right is in a1
   # Use lw to dereference both left and right
   lw a0, (a0)
   lw a1, (a1)
   # Now, add them together
   add a0, a0, a1
   ret

Whatever we leave in the a0 register is what is returned. That’s why our ABI calls the a0 register the “function arguments / return value”.


Structures

A structure is just multiple data in contiguous memory. The hard part comes from the fact that the C compiler will try to make memory accesses as efficient as possible by aligning data structures to their sizes. I will show this below.

struct SomeStruct {
   int a;
   int b;
   int c;
};

Above, when we allocate memory for SomeStruct, we need 12 bytes (one integer is 4 bytes). In memory, bytes 0, 1, 2, 3 are taken by a, 4, 5, 6, 7 are taken by b, and 8, 9, 10, 11 are taken by c. So, in assembly, we can just look at the offsets 0, 4, and 8 and use lw (load word) to load from this structure.

So, let’s use this structure to see what it looks like in assembly.

int work(struct SomeStruct *s) {
   return s->a + s->b * s->c;
}

In assembly, we would do the following:

.section .text
.global work
work:
   # s->a is 0(a0)
   # s->b is 4(a0)
   # s->c is 8(a0)
   lw t0, 0(a0)  # s->a
   lw t1, 4(a0)  # s->b
   lw t2, 8(a0)  # s->c
   mul t1, t1, t2
   add a0, t0, t1
   ret

The a0 register stores the memory address where the structure starts. Then, we calculate offsets for each one of the elements a, b, and c. However, this is an easy example. There are two other examples we must explore: (1) heterogeneous data types and (2) structures in arrays.

Heterogeneous Data Types

What happens when we have multiple data types inside of a structure? The C compiler tries to make the memory controller’s work easy, so it will pad the structure so that when we access a field in a structure, it only takes one request from the memory controller. With heterogeneous data types, this means we have to waste a little bit of memory.

Take the following structure for example:

struct SomeStruct {
   char a;
   int b;
   short c;
};

The structure above has 1-byte, 4-byte, and 2-byte fields, so does that mean this structure is \(1+4+2=7\) bytes? The answer is no, the reason is padding. C will pad each element so that the \(\text{offset}\%\text{size}=0\). That is, when we look at the offset of the element, it must be evenly divisible by the size of the field.

When I teach my students, I have them draw a name, offset, size table (NOS table). In here, we can see what’s going on, and it makes it easy to translate into assembly.

NameOffsetSize
char a01
int b44
short c82

We can see with the table above that int b starts at offset 4. So what happened to bytes 1, 2, and 3? Those are wasted in the form of padding. So, now we can figure out the offset of each element, so the following is now translatable:

int work(struct SomeStruct *s) {
   return s->a + s->b * s->c;
}

We saw that with three integers in a structure, we had offsets 0, 4, and 8. Well, lookie here! They are the same. The only thing we need to change is how we load. We look at the size column. Recall that 1 byte is a byte (lb/sb), 4 bytes is a word (lw/sw) and 2 bytes is a halfword (lh/sh).

.section .text
.global work
work:
   lb t0, 0(a0)  # s->a
   lw t1, 4(a0)  # s->b
   lh t2, 8(a0)  # s->c
   mul t1, t1, t2
   add a0, t1, t0
   ret

Structures in Arrays

C allows any data structure to be stored in an array. An array is simply a memory address with like fields, however those like fields can be structures. The following diagram shows how an integer-only array with 5 elements would look in memory:

You can see the formula \(\text{address}=\text{base}+\text{offset}\times\text{size}\). So, if our base is a (let’s say 0 for now), then a[4] would be \(0+4\times 4=16\).

For structures, there can possibly be padding at the end of the structures. This padding is only for arrays. Remember that we need offset % size be zero. When looking for the amount of padding at the end of the structure, we take the remaining offset and % by the largest field. We pad until this is zero.

Let’s see how a structure would look in memory.

struct My {
	char a;
	long b;
	int *c;
};
struct My m[5];
m[0].a = 'A';
m[3].b = 22;

We have an array of five of these structures. Each element has fields a, b, and c. Below is a diagram of what our memory looks like:


Globals

Recall that C generally uses the stack for local variables. However, C allows for global variables. Globals are stored in one of three sections: .rodata, .data, or .bss. The .rodata (read only data) stores global constants, the .data stores global, initialized variables, and the .bss stores global, uninitialized variables or global variables initialized to 0.

The BSS section is the weirdest one here. This section is automatically cleared to 0 by the operating system. So, any uninitialized variable actually has a defined value–0.

We can see these sections in action by looking at a C example:

#include <stdio.h>
int data = 100;
const int rodata = 200;
int bss;
int main() {
   printf("%d %d %d\n", data, rodata, bss);
   return 0;
}

If we look at what this produces, we see the following:

We can see that our rodata constant has the memory address of 0x2004 and has the value 0x00_00_00_c8 which is 200, our data variable has the memory address 0x4030 and has the value 0x00_00_64 which is 100. Finally our bss variable is stored at memory address 0x4038 and will have the value 0. The section only stores where the variable is located, not what is in it. The operating system itself will look at this section and assign all variables the value 0.


Conditions and Loops

Conditions are made by using a branch instruction. Some architectures support adding conditions to each instruction, but that is becoming more rare as architectures advance. Instead, we branch to a different label to either jump over code or jump into code. Let’s take a look at a for loop to see how it would look in assembly.

for (int i = 100;i >= 0;i--) {
   printf("%d\n", i);
}

The assembly equivalent would be something like the following:

.section .rodata
printf_string: "%d\n"
.section .text
.global main
main:
addi sp, sp, -16
sd s0, 0(sp)
sd ra, 8(sp)
li s0, 100 # int i = 100
1:
   blt s0, zero, 1f  # if i < 0, break the loop
   la a0, printf_string
   mv a1, s0
   call printf
   addi s0, s0, -1 # i--
   j 1b
1:
ld s0, 0(sp)
ld ra, 8(sp)
ret

So, if statements and loops conditionally execute code by using a branch instruction. In RISC-V, our branch instruction compares two registers. If those registers meet the condition, we go to the label (1f = label 1 forward, 1b = label 1 backwards). If the registers don’t meet the condition, the instruction has no effect and we move to the next instruction.


Data Types

The purpose of data types in C (when considering assembly) is to make sure we choose the correct load/store size (byte, halfword, word, and doubleword), however we also need to choose the correct operations. Most unsigned integers will zero-extend when a smaller data size is widened into a larger data size. For example:

unsigned char a = 100;
unsigned int b = (unsigned int)a;

The variable ‘a’ is only 8 bits, however it will widen to 32 bits. There are two ways we can widen a small data size into a larger data size. We can zero extend, by occupying the newly allocated space with zeroes. Or, we can sign extend, by taking the sign bit and replicating it over the new space. The choice depends on the data size. However, in assembly, it actually requires a different instruction. For example:

lb t0, 0(sp)
lbu t1, 0(sp)

The ‘lb’ (load byte) instruction will sign extend an 8-bit value in sp + 0 into a 64-bit register t0. The ‘lbu’ (load byte unsigned) instruction will zero extend an 8-bit value in sp + 0 into a 64-bit register t1. In RISC-V, we’re always dealing with a 64-bit register. C will make the distinction when emitting instructions between char, short, int, and longs.

Another occasion becomes what to do with a right shift (>>). There are actually two right shifts–logical and arithmetic. A logical right shift is used for unsigned data types and zeroes are inserted in the most significant digits place whenever bits are shifted right. An arithmetic right shift is used for signed data types and the sign bit is duplicated every time we shift right.

unsigned int a = 64;
int b = -64;

a = a >> 1;
b = b >> 1;

The variable ‘a’ above will be logically right shifted one place, which will give us 32. The variable ‘b’ will be arithmetically right shifted one place. If we did a logical shift, this would turn into a positive number. Since we duplicated the sign bit, ‘b’ becomes -32.

li t0, 64
li t1, -64
srli t0, t0, 1
srai t1, t1, 1

Conclusion and Further Reading

All in all, the compiler, when emitting its assembly code, must make these decisions. One data type of note is char. The standard does not 100% state whether we should sign extend or zero extend a char. In fact, I noticed a difference between GNU’s RISC-V toolchain vs GNU’s Intel/AMD toolchain. The former will sign extend, whereas the latter zero extends.

To see what a compiler is doing, it helps to look at the assembly code. GNU Binutils has objdump that can disassemble machine code back into assembly code. However, there may come a time where we look at the C code to see what a certain language’s compiler is doing. Who knows?

I was pretty light on the specifics when it comes to compiling. I recommend my colleague and friend’s blog about compilers here: http://austinhenley.com/

Operating Systems Blog

https://osblog.stephenmarz.com/