Category Archive : Languages

The RISC-V APLIC’s New Features

Contents

  1. Repository
  2. Introduction
  3. The APLIC
  4. Conclusion

Repository

This blog series refers to the code written here: https://github.com/sgmarz/riscv_msi.

The APLIC specification (still in draft) is part of the Advanced Interrupt Architecture (AIA) specification, and it is kept here: https://github.com/riscv/riscv-aia.

I am using AIA specification version 0.3.0-draft.31 to write this article.


Introduction

The advanced platform level interrupt controller (APLIC) is an advanced version of SiFive’s PLIC. The main advancement is that it supports sending interrupts via message. So, when paired with an incoming MSI controller (IMSIC), the APLIC can send messages just like any other hardware device.

The main purpose of the original PLIC was to aggregate, prioritize, and send hardware interrupt signals. It provided a method for an operating system to claim, handle, and complete an interrupt request. The PLIC sent a notification to a specific HART (hardware thread — pretty much a CPU core) through an “external interrupt” pin. That HART would then determine what interrupted it by reading from a specific PLIC register called the claim register. This would return a number identifying the device. Usually, this number is a specific wire connecting a hardware device to the PLIC.


The Advanced Platform Level Interrupt Controller (APLIC)

The new APLIC is not backwards compatible with the SiFive PLIC. Some of the concepts and features are the same, but the register file and mappings are different. Technically, a system may be implemented without a full APLIC, but the AIA documentation specifically states that “[f]ull conformance to the Advanced Interrupt Architecture requires the APLIC.”

First, the APLIC registers are laid out as follows:

struct Aplic {
    pub domaincfg: u32,           // Domain CSR that controls how this APLIC functions
    pub sourcecfg: [u32; 1023],   // Source configuration for 1023 interrupts
    _reserved1: [u8; 0xBC0],

    pub mmsiaddrcfg: u32,         // Machine-level MSI address (for APLIC to write MSIs)
    pub mmsiaddrcfgh: u32,
    pub smsiaddrcfg: u32,         // Supervisor-level MSI address
    pub smsiaddrcfgh: u32,
    _reserved2: [u8; 0x30],

    pub setip: [u32; 32],         // Bitset to set pending interrupts (32 IRQS per element)
    _reserved3: [u8; 92],

    pub setipnum: u32,            // Sets a pending interrupt by number
    _reserved4: [u8; 0x20],

    pub clrip: [u32; 32],         // Bitset to clear pending interrupts (opposite of setip)
    _reserved5: [u8; 92],

    pub clripnum: u32,            // Clears a pending interrupt by number
    _reserved6: [u8; 32],

    pub setie: [u32; 32],         // Bitset to enable interrupts
    _reserved7: [u8; 92],

    pub setienum: u32,            // Enable an interrupt by number
    _reserved8: [u8; 32],

    pub clrie: [u32; 32],         // Bitset to disable interrupts (opposite of setie)
    _reserved9: [u8; 92],

    pub clrienum: u32,            // Disable an interrupt by number
    _reserved10: [u8; 32],

    pub setipnum_le: u32,         // Set an interrupt pending by number always little end first
    pub setipnum_be: u32,         // Set an interrupt pending by number always big end first
    _reserved11: [u8; 4088],

    pub genmsi: u32,              // Used to generate MSIs
    pub target: [u32; 1023],      // Target control per interrupt
}

Domain Configuration Register (domaincfg)

Domain Configuration Register (32-bits)

There are three usable fields in this register, interrupt enable (IE, bit 8), delivery mode (DM, bit 2), and big endian (BE, bit 0).

The first 8 bits is a byte order mark for all intents and purposes. It is set to 0x80 so that we can test to see whether the register is big end first or little end first. For example, if we instruct a “load word” (lw instruction in RISC-V), and we get 0x80 in the first byte, that means the machine is in big endian. If we don’t, it is little endian, and that byte we loaded contains the delivery mode and big-endian bits.

The interrupt enable bit enables the APLIC to send interrupts (1 = enabled, 0 = disabled). This doesn’t mean that the interrupt will necessarily be heard, but instead, it only means the APLIC can send interrupts by triggering a pending bit.

The delivery mode bit allows the APLIC to be configured to send normal interrupts, like the old PLIC, or to send interrupts as messages (MSIs). If the DM bit is set to 0, it will send direct interrupts, like the old APLIC. If the DM bit is set to 1, it will send MSIs instead.

The big endian bit allows the APLIC to write in big endian (BE = 1) or to write messages in little endian (BE = 0). However, the BE bit also affects the order of the multibyte domain configuration register too. The most significant byte is set to 0x80 on purpose to act as sort of a byte order mark.

We can configure a Rust function to set the domain register. All fields in this register are Boolean (yes/no, true/false, 1/0).

impl Aplic {
    pub fn set_domaincfg(&mut self, bigendian: bool, msimode: bool, enabled: bool) {
        // Rust library assures that converting a bool into u32 will use
        // 1 for true and 0 for false
        let enabled = u32::from(enabled);
        let msimode = u32::from(msimode);
        let bigendian = u32::from(bigendian);
        self.domaincfg = (enabled << 8) | (msimode << 2) | bigendian;
    }
}

Source Configuration Registers (sourcecfg[u32; 1023])

If bit 10 (delegate) is 0, SM is in bits 2:0. If D=1, bits 9:0 mark the child index.

There is one sourcecfg register for every interrupt possible. Recall that interrupt 0 is not possible, so interrupt 1’s source configuration register is in sourcecfg[0]. This is why the example Rust code I provided subtracts 1 from the interrupt source’s value.

The delegate bit (bit 10) can be read to determine if the given interrupt has been delegated. This bit is read/write. If we write a 1 into this field, it delegates it to a child domain. If that particular source does not have a child domain, this bit will always be read 0.

The source mode bits (bits 2:0) control how the interrupt is triggered for those interrupts not delegated. If an interrupt is delegated (D=1), the bits 9:0 describe the index of the child it was delegated to. If the interrupt is not delegated (D=0), then each interrupt source may be configured to trigger a “pending” interrupt by one of the following.

3-bit “SM” valueRegister NameDescription
0InactiveThe interrupt source cannot generate an interrupt and is inactive.
1DetachedThe interrupt can only be generated by writing directly to the APLIC.
2, 3Reserved
4Edge1The interrupt is asserted on a rising edge (from 0 to 1).
5Edge0The interrupt is asserted on a falling edge (from 1 to 0).
6Level1The interrupt is asserted when high (device IRQ pin is asserted).
7Level0The interrupt is asserted when low (device IRQ pin is deasserted).

An inactive interrupt source cannot generate an interrupt through the APLIC, and we cannot manually assert an interrupt by writing to the interrupt pending register. A detached interrupt source cannot be triggered by the device, however we can manually write to the MMIO APLIC registers to assert the interrupt.

#[repr(u32)]
enum SourceModes {
    Inactive = 0,
    Detached = 1,
    RisingEdge = 4,
    FallingEdge = 5,
    LevelHigh = 6,
    LevelLow = 7,
}
impl Aplic {
    pub fn set_sourcecfg(&mut self, irq: u32, mode: SourceModes) {
        assert!(irq > 0 && irq < 1024);
        self.sourcecfg[irq as usize - 1] = mode as u32;
    }

    pub fn sourcecfg_delegate(&mut self, irq: u32, child: u32) {
        assert!(irq > 0 && irq < 1024);
        self.sourcecfg[irq as usize - 1] = 1 << 10 | (child & 0x3ff);
    }
}

Set/Clear Pending/Enable Interrupt (setip/clrip) Registers

These registers control whether an interrupt is pending. The APLIC will set these bits itself whenever an enabled interrupt is triggered; however, we can manually set an interrupt as a pending interrupt.

These registers are unlike the IMSIC registers. There is a set of registers to set a pending interrupt and a set of registers to clear a pending interrupt.

The setipnum and clripnum registers function much the same way as the following Rust functions that use the setip/clrip registers.

impl Aplic
    pub fn set_ip(&mut self, irq: u32, pending: bool) {
        assert!(irq > 0 && irq < 1024);
        let irqidx = irq as usize / 32;
        let irqbit = irq as usize % 32;
        if pending {
            // self.setipnum = irq;
            self.setip[irqidx] = 1 << irqbit;
        } else {
            // self.clripnum = irq;
            self.clrip[irqidx] = 1 << irqbit;
        }
    }
}

The interrupt enable registers act much like the interrupt pending registers, except they allow the interrupts to be signaled. If an interrupt is NOT enabled, then that interrupt is masked and cannot be triggered.

impl Aplic {
    pub fn set_ie(&mut self, irq: u32, enabled: bool) {
        assert!(irq > 0 && irq < 1024);
        let irqidx = irq as usize / 32;
        let irqbit = irq as usize % 32;
        if enabled {
            // self.setienum = irq;
            self.setie[irqidx] = 1 << irqbit;
        } else {
            // self.clrienum = irq;
            self.clrie[irqidx] = 1 << irqbit;
        }
    }
}

Generate MSI Register (genmsi)

There are two read/write fields and one read-only field in the genmsi register. The genmsi register can be used to trigger an MSI write, even though writing directly to the IMSIC is more efficient.

The hart index is the HART that you want to send an MSI and the EIID (external interrupt identifier) is the value to write to the IMSIC. Usually the EIID is the same number as the interrupt you want to trigger.


Target Control Registers (target[u32; 1032])

The target control registers have two different forms based on how the APLIC is configured. If the APLIC is configured in direct delivery mode, then the register contains a HART index and a priority. Smaller priorities are higher, so 10 has a higher priority than 20.

impl Aplic {
    pub fn set_target_direct(&mut self, irq: u32, hart: u32, prio: u32) {
        assert!(irq > 0 && irq < 1024);
        self.target[irq as usize - 1] = (hart << 18) | (prio & 0xFF);
    }
}

In MSI delivery mode, the register contains a HART index, guest index, and external interrupt identifier (EIID).

impl Aplic {
    pub fn set_target_msi(&mut self, irq: u32, hart: u32, guest: u32, eiid: u32) {
        assert!(irq > 0 && irq < 1024);
        self.target[irq as usize - 1] = (hart << 18) | (guest << 12) | eiid;
    }
}

Interrupt Delivery Control

As I mentioned previously, the APLIC may be configured in MSI or direct mode. In MSI mode, the messages are handled by the incoming MSI controller (IMSIC); however, in direct mode, the APLIC itself will control interrupts. This is done through the interrupt delivery control (IDC) section of the APLIC.

For the virt machine on QEMU, the interrupt delivery control registers are memory mapped to 0x. The registers are laid out as follows.

OffsetSize (bytes)Register Name
04idelivery
44iforce
84ithreshold
244topi
284claimi

Each IDC is for an individual HART and is 32 bytes. Therefore, HART #0’s registers start at offset 0, HART #1’s registers start at offset 32, and so forth.

The idelivery register enables the APLIC’s direct delivery mode. If this register is set to 1, then the APLIC may deliver interrupts, otherwise, if this register is set to 0, then the APLIC does not deliver interrupts.

The idelivery register.

The iforce register forces the APLIC to deliver an interrupt #0. If we write 1 to this register and the APLIC’s interrupts are enabled, then this will signal an interrupt.

The iforce register

The ithreshold register sets the interrupt priority threshold for an interrupt to be heard. A threshold of 0 means that all interrupts can be heard. If the value is non-zero, then any interrupt priority of that value or higher are masked, and hence not heard. In this case, as with everything else, the lower the number, the higher the priority. Don’t let 0 fool you, it’s just a special value that unmasks all priorities. However, a threshold of 1 means all interrupts will be masked, whereas a threshold of 2 means that only priority 1 can be heard.

The ithreshold register.

The topi register holds the interrupt number that is the highest priority and is enabled. The register is split into two pieces: bits 25:16 hold the interrupt number and bits 7:0 hold the interrupt priority.

The topi and claimi registers

The claimi register is the same as topi except that it signals that we are claiming the top interrupt. When we read from this register, the pending bit for the given register will be cleared to 0.

The OS on the repo that I made only uses the MSI delivery mode, since this delivery control is much like the old PLIC’s.


Conclusion

QEMU’s virt machine connects the UART to external IRQ #10, so we can use the APLIC to send messages whenever the UART receiver has data. This requires us to set up the APLIC using the registers above.

In the code below, I split between the machine and supervisor mode and delegate the UART to the supervisor APLIC (index 0).

pub fn aplic_init() {
    // The root APLIC
    let mplic = Aplic::as_mut(AplicMode::Machine);
    // The delgated child APLIC
    let splic = Aplic::as_mut(AplicMode::Supervisor);

    // Enable both the machine and supervisor PLICS
    mplic.set_domaincfg(false, true, true);
    splic.set_domaincfg(false, true, true);

    // Write messages to IMSIC_S
    mplic.set_msiaddr(AplicMode::Supervisor, crate::imsic::IMSIC_S);

    // Delegate interrupt 10 to child 0, which is APLIC_S
    // Interrupt 10 is the UART. So, whenever the UART receives something
    // into its receiver buffer register, it triggers an IRQ #10 to the APLIC.
    mplic.sourcecfg_delegate(10, 0);

    // The EIID is the value that is written to the MSI address
    // When we read TOPEI in IMSIC, it will give us the EIID if it
    // has been enabled.
    splic.set_target_msi(10, 0, 0, 10);

    // Level high means to trigger the message delivery when the IRQ is
    // asserted (high).
    splic.set_sourcecfg(10, SourceModes::LevelHigh);

    // The order is important. QEMU will not allow enabling of the IRQ
    // unless the source configuration is set properly.
    // mplic.set_irq(10, true);
    splic.set_ie(10, true);
}

We can now write the UART handler in the IMSIC whenever the UART sends interrupts.

pub fn imsic_handle(pm: PrivMode) {
    let msgnum = imsic_pop(pm);
    match msgnum {
        0 => println!("Spurious 'no' message."),
        2 => println!("First test triggered by MMIO write successful!"),
        4 => println!("Second test triggered by EIP successful!"),
        10 => console_irq(),
        _ => println!("Unknown msi #{}", msgnum),
    }
}

The code above forwards any message #10 to console_irq, which in turn pops a value off of the receiver buffer register of the UART device.

When all is said and done, we can start getting UART messages:

RISC-V Is Getting MSIs!

Contents

  1. Overview
  2. Repository
  3. Message Signaled Interrupts (MSI)
  4. Incoming MSI Controller (IMSIC)
  5. Conclusion
  6. What’s Next

Overview

Message signaled interrupts or MSIs describe a way to signal an interrupt without a dedicated interrupt request pin (IRQ). One of the most prevalent uses for MSIs is the PCI bus, and the PCI specification defines the MSI and MSI-X standards. The potential benefits may include: (1) reduced number of direct wires from the device to the CPU or interrupt controller, (2) improve signaling performance–mainly by forcing in-band signals by design, and (3) improving guest/host signaling for virtualized environments.

Wikipedia’s article on MSIs can be found here: https://en.wikipedia.org/wiki/Message_Signaled_Interrupts


Repository

All of the code for this post can be found here: https://github.com/sgmarz/riscv_msi. Use tag “msi”.

The code is written for RV32I in Rust. I originally wrote it for RV64GC, but everything else I wrote is also for RV64GC, so I figured I should branch out and broaden my horizons.

The AIA draft manual, written by John Hauser, can be found here: https://github.com/riscv/riscv-aia. This blog post uses version 0.3.0-draft.31, tagged on 13-Jun-2022.


Message Signaled Interrupts (MSI)

An MSI is triggered by a “message”, which is a fancy term for a “memory write”. In fact, we can trigger a message by simply dereferencing an MMIO pointer.

// Note that 0xdeadbeef is not really a valid message. 
// The AIA specifies messages 1 through 2047 are valid due to the number of registers
// available. But, the following is just an example.
let message = 0xdeadbeef;
// QEMU's 'virt' machine attaches the M-mode IMSIC for HART 0 to 0x2400_0000
// The AIA specifies that this must be a word, a double word will cause a trap.
write_volatile(0x2400_0000 as *mut u32, message);

Memory Mapped IO Addresses for Interrupt Files

The code above writes to the MMIO address 0x2400_0000, which is where QEMU’s virt machine connects the M-mode IMSIC for HART 0. The S-mode IMSIC for HART 0 is connected to 0x2800_0000. Each HART is a page away from each other, meaning the M-mode IMSIC for HART 1 is at 0x2400_1000, and the S-mode IMSIC for HART 1 is 0x2800_1000.

For many embedded systems, these values would come from a specification or from an open firmware (OF) package that contained a flattened device tree (FDT) that specifies an MMIO address and what’s connected to it. Here’s an example of QEMU’s virt FDT as plain text. For the repo, I hard coded the MMIO addresses instead of reading a device tree.

imsics@24000000 {
   phandle = <0x09>;
   riscv,ipi-id = <0x01>;
   riscv,num-ids = <0xff>;
   reg = <0x00 0x24000000 0x00 0x4000>;
   interrupts-extended = <0x08 0x0b 0x06 0x0b 0x04 0x0b 0x02 0x0b>;
   msi-controller;
   interrupt-controller;
   compatible = "riscv,imsics";
};

The organization that standardizes device tree formats can be found here: https://www.devicetree.org/

More information about device trees in Linux can be found here: https://www.kernel.org/doc/html/latest/devicetree/usage-model.html

IMSICs were added to QEMU’s virt machine fairly recently, so you may need to clone and build your own QEMU. QEMU’s repository can be found here: https://github.com/qemu.

Triggering an Interrupt by Sending a Message

After a device writes a word to a specific MMIO address, the interrupt is triggered. This means that devices do not need a wire connecting it to an IRQ controller, such as RISC-V’s platform-level interrupt controller, or PLIC. Instead, as long as the device can make a memory write, it can trigger an interrupt.

Even though triggering a message is that simple, we need a mechanism to enable and prioritize these messages. There might be some circumstances where we don’t want to hear certain messages. This is where the incoming MSI controller or IMSIC comes into play.


Incoming MSI Controller (IMSIC)

To be able to support MSIs, some device needs to be able to take memory writes and turn them into interrupts. Furthermore, the device needs to provide a mechanism to enable/disable and to prioritize interrupts just like a regular interrupt controller. This is done by the incoming MSI controller (IMSIC) device.

WARNING: The Advanced Interrupt Architecture (AIA) manual is still a work-in-progress, and already, there have been major changes that removed or added CSRs or other pertinent information. So, some of the code and tables might be outdated.

The register mechanism for the IMSIC consists of several control and status registers (CSRs) as well as internal registers accessibly through a selection mechanism.

Newly Added Registers

The AIA defines several new CSRs separated between the machine and supervisor modes.

Register NameRegister NumberDescription
MISELECT0x350Machine register select
SISELECT0x150Supervisor register select
MIREG0x351A R/W view of the selected register in MISELECT
SIREG0x151A R/W view of the selected register in SISELECT
MTOPI0xFB0Machine top-level interrupt
STOPI0xDB0Supervisor top-level interrupt
MTOPEI0x35CMachine top-level external interrupt (requires IMSIC)
STOPEI0x15CSupervisor top-level external interrupt (requires IMSIC)
New CSRs defined for the AIA.

The registers MISELECT and MIREG allow us to select a register by writing its number into the MISELECT register. Then the MIREG will represent the selected register. For example, if we read from MIREG, we read from the selected register, and if we write to MIREG, we write to the selected register.

There are four selectable registers. There are machine and supervisor versions of these registers. For example, if we write to SISELECT, we will view the supervisor version of the register.

Register NameMISELECT/SISELECTDescription
EIDELIVERY0x70External Interrupt Delivery Register
EITHRESHOLD0x72External Interrupt Threshold Register
EIP0 through EIP630x80 through 0xBFExternal Interrupt Pending Registers
EIE0 through EIE630xC0 through 0xFFExternal Interrupt Enable Registers
Registers selectable by MISELECT/SISELECT and readable/writeable via MIREG/SIREG.

The first thing we need to do is enable the IMSIC itself. This is done through a register called EIDELIVERY for “enable interrupt delivery” register. This register may contain one of three values:

ValueDescription
0Interrupt delivery is disabled
1Interrupt delivery is enabled
0x4000_0000Optional interrupt delivery via a PLIC or APLIC
Value written to EIDELIVERY register

Enabling the IMSIC

So, we need to write 1 (Interrupt delivery is enabled) into the EIDELIVERY register:

// First, enable the interrupt file
// 0 = disabled
// 1 = enabled
// 0x4000_0000 = use PLIC instead
imsic_write(MISELECT, EIDELIVERY);
imsic_write(MIREG, 1);

The EITHRESHOLD register creates a threshold that interrupt priorities must be before it can be heard. If an interrupt has a priority less than the value in EITHRESHOLD, it will be “heard” or unmasked. Otherwise, it will be masked and will not be heard. For example, an EITHRESHOLD of 5 would only permit messages 1, 2, 3, and 4 to be heard. The message 0 is reserved to mean “no message”.

Since a higher threshold opens more messages, messages with a lower number have a higher priority.

// Set the interrupt threshold.
// 0 = enable all interrupts
// P = enable < P only
imsic_write(MISELECT, EITHRESHOLD);
// Only hear 1, 2, 3, and 4
imsic_write(MIREG, 5);

Interrupt Priorities

The AIA documentation uses the message itself as the priority. So, a message of 1 has a priority of 1, whereas a message of 1234 has a priority of 1234. This is more convenient since we can control messages directly. However, since each message number has associated enable and pending bits, there is a limit to the highest numbered interrupt. The specification has a maximum of \(32\times64 – 1 = 2,047\) total messages (we subtract one to remove 0 as a valid message).

Enabling Messages

The EIE register controls whether a message is enabled or disabled. For RV64, these registers are 64-bits wide, but still take up two adjacent register numbers. So, for RV64, only even numbered registers are selectable (e.g., EIE0, EIE2, EIE4, …, EIE62). If you try to select an odd numbered EIE, you will get an invalid instruction trap. This took me many hours to figure out even though the documentation states this is the desired behavior. For RV32, the EIE registers are only 32-bits, and EIE0 through EIE63 are all selectable.

The EIE register is a bitset. If the bit for a corresponding message is 1, then it is unmasked and enabled, otherwise it is masked and is disabled. For RV64, messages 0 through 63 are all in EIE0[63:0]. The bit is the message. We can use the following formulas to determine which register to select for RV64:

// Enable a message number for machine mode (RV64)
fn imsic_m_enable(which: usize) {
    let eiebyte = EIE0 + 2 * which / 64;
    let bit = which % 64;

    imsic_write(MISELECT, eiebyte);
    let reg = imsic_read(MIREG);
    imsic_write(MIREG, reg | 1 << bit);
}

RV32 behaves much the same, except we don’t have to scale it by 2.

// Enable a message number for machine mode (RV32)
fn imsic_m_enable(which: usize) {
    let eiebyte = EIE0 + which / 32;
    let bit = which % 32;

    imsic_write(MISELECT, eiebyte);
    let reg = imsic_read(MIREG);
    imsic_write(MIREG, reg | 1 << bit);
}

With the code above, we can now enable the messages we want to hear. The following example enables messages 2, 4, and 10.

imsic_m_enable(2);
imsic_m_enable(4);
imsic_m_enable(10);

Pending Messages

The EIP registers behave in the exact same way as the EIE registers except that a bit of 1 means that that particular message is pending, meaning a write to the IMSIC with that message number was sent. The EIP register is read/write. If we read from it, we can determine which messages are pending. If we write to it, we can manually trigger an interrupt message by writing a 1 to the corresponding message bit.

// Trigger a message by writing to EIP for Machine mode in RV64
fn imsic_m_trigger(which: usize) {
    let eipbyte = EIP0 + 2 * which / 64;
    let bit = which % 64;

    imsic_write(MISELECT, eipbyte);
    let reg = imsic_read(MIREG);
    imsic_write(MIREG, reg | 1 << bit);
}

Testing

Now that we can enable the delivery as well as indiviual messages, we can trigger them one of two ways: (1) write the message directly to the MMIO address or (2) set the interrupt pending bit of the corresponding message to 1.

unsafe {
    // We are required to write only 32 bits.
    // Write the message directly to MMIO to trigger
    write_volatile(0x2400_0000 as *mut u32, 2);
}
// Set the EIP bit to trigger
imsic_m_trigger(2);

Message Traps

Whenever an unmasked message is sent to an enabled IMSIC, it will come to the specified HART as an external interrupt. For the machine-mode IMSIC, this will come as asynchronous cause 11, and for the supervisor-mode IMSIC, this will come as asynchronous cause 9.

When we receive an interrupt due to a message being delivered, we will need to “pop” off the top level pending interrupt by reading from the MTOPEI or STOPEI registers depending on the privilege mode. This will give us a value where bits 26:16 contain the message number and bits 10:0 contain the interrupt priority. Yes, the message number and message priority are the same number, so we can choose either.

// Pop the top pending message
fn imsic_m_pop() -> u32 {
    let ret: u32;
    unsafe {
        // 0x35C is the MTOPEI CSR.
        asm!("csrrw {retval}, 0x35C, zero", retval = out(reg) ret),
    }
    // Message number starts at bit 16
    ret >> 16
}

My compiler does not support the names of the CSRs in this specification, so I used the CSR number instead. That is why you see 0x35C instead of mtopei, but they mean the same thing.

When we read from the MTOPEI register (0x35C), it will give us the message number of the highest priority message. The instruction csrrw in the code snippet above will atomically read the value of the CSR into the return register and then store the value zero into the CSR.

When we write zero into the MTOPEI register (0x35C), we are telling the IMSIC that we are “claiming” that we are handling the topmost message, which clears the EIP bit for the corresponding message number.

/// Handle an IMSIC trap. Called from `trap::rust_trap`
pub fn imsic_m_handle() {
    let msgnum = imsic_m_pop();
    match msgnum {
        0 => println!("Spurious message (MTOPEI returned 0)"),
        2 => println!("First test triggered by MMIO write successful!"),
        4 => println!("Second test triggered by EIP successful!"),
        _ => println!("Unknown msi #{}", v),
    }
}

Message 0 is not a valid message since when we pop from the MTOPEI, a 0 signals “no interrupt”.

Example Output

If you run the repo with a new QEMU, you should see the following after a successful test.


Conclusion

In this post, we saw the new registers added to support the incoming MSI controller (IMSIC) for RISC-V. We also enabled the IMSIC delivery, the individual messages, and handled two ways of sending a message: via MMIO directly or by setting the corresponding EIP bit. Finally, we handled the interrupt from the trap.


What’s Next?

The second part of the AIA manual includes the new Advanced Platform-Level Interrupt Controller or APLIC. We will examine this system as well as write drivers to start looking at how this new APLIC can signal using wires or messages.

After the APLIC, we will write a driver for a PCI device and use it to send MSI-X messages.

Five Tips to Writing RISC-V Assembly

Writing assembly is itself an art. When C, C++, or any other language is compiled, the compiler determines the art of writing assembly. However, this time, we will some of the techniques and decisions we can make to write these ourselves.

We will use RISC-V to see how to design logic, write up the logic, and translate the logic into assembly.


Tip 1: Design the Logic in a Comfortable Language

This is the hardest step to get right with my students. Many students want to sit down and write the complete package. However, if you’re not comfortable with assembly, this is a doomed approach. Instead, to divide the logic from the language, we have to write in a language we understand.

If a student doesn’t know C or some lower-level language, then I suggest they write in pseudocode. Too high of a language will make the translation harder and too low of a language will make the logic design harder. So, I recommend C or C++, or some language at that level.

When translating, it is helpful to have some sort of editor that can place them side-by-side. Keeping a list of instructions in your brain is difficult, especially if you’re translating a complex program.

Atom’s “split” windows.

Tip 2: Take Small Bites

Many of my students try to write the entire program from start to finish without testing anything in between. I teach incremental programming, especially for new learners. The point is to test when you get a portion of logic completed. This could be as simple as getting a for loop done, or shifting to scale a pointer.

One way you can test is to link your C or C++ program with your assembly program. You can do this by prototyping the name of the assembly function in C++ and switching between the two. You need to make sure you keep the two different, otherwise your linker won’t be happy. I usually put a “c” in front of my C functions to differentiate. I don’t change the assembly names since in the end, we want all assembly functions to replace the C functions.

Combining assembly-written and C-written functions.

With what I’ve done above, we can call show to run the assembly function or cshow to run the C function. In C++, the names of the functions are mangled to allow for overloading. However, you can turn this off by doing the following:

extern "C" { // Turn off name mangling
    void show(int *x);
}

The extern “C” will tell C++ that the functions follow the C “calling convention”. We really care that name mangling is turned off so that we can just create a label named “show” and have our function.


Tip 3: Know Your Role

As Dwayne “The Rock” Johnson always said, “know your role”. It is important to know what C/C++ was doing for us versus what assembly doesn’t do for us. This includes order of operations. For example, 4 + 3 * 4 will automatically order the operations to perform multiplication first, followed by addition. However, in assembly, we must choose the multiply instruction first followed by the addition instruction. There is no “reordering” performed for us.


Tip 4: Know How to Call a Function

Most ISA architectures will come with a calling convention manual, such as ARM and RISC-V. These just lay down some ground rules to make calling a function work across all languages. Luckily, the “ABI” names of the RISC-V registers really lend to what they mean. Here are some of the rules we need to abide by.

  • Integer arguments go in a0-a7, floating point arguments go in fa0-fa7.
  • Allocate by subtracting from the stack pointer, deallocate by adding it back.
  • The stack must be allocated in chunks of 8.
  • All a (argument) and t (temporary) registers must be considered destroyed after a function call. Any time I see the “call” pseudo-instruction, I start thinking about saving registers using the stack.
  • All s (saved) registers can be considered saved after a function call.
  • If you use any of the saved registers, you must restore their original value before returning.
  • Return data back to the caller through the a0 register.
  • Don’t forget to save the one and only RA (return address) register any time you see a call instruction (jal or jalr) in your function.

Obviously, I’ve reinterpreted the rules in a more informal way, but you get the jist of what’s going on here.

.global main
main:
    addi    sp, sp, -8
    sd      ra, 0(sp)
    la      a0, test_solve
    call    solve
    mv      a0, zero
    ld      ra, 0(sp)
    addi    sp, sp, 8
    ret

You can see from the code above, we allocate our stack frame first, save all registers that need to be saved, execute, and then undo everything before returning.


Tip 5: Document!

Writing assembly from C or another language will have you writing multiple lines of assembly code for every single line of C code. This can get confusing and utterly frustrating if you’re trying to debug your program. So, I always write the C code as a comment for the assembly and then pull it apart and show each step of me doing it.

# used |= 1 << ( x[i * 9 + col] - 1)
    li      t0, 9
    mul     t1, s3, t0          # t1 = i * 9
    add     t1, t1, s2          # t1 = i * 9 + col
    slli    t2, t1, 2           # Scale by 4
    add     t2, t2, s6          # x + i * 9 + col
    lw      t3, 0(t2)           # x[i * 9 + col]
    addi    t3, t3, -1          # x[i * 9 + col] - 1
    li      t4, 1
    sll     t4, t4, t3          # 1 << x[i * 9 + col] - 1
    or      s5, s5, t4          # used |= ...

You can see from the code above, I have the original C code (first comment), and then inline comments for each piece. This keeps me honest when it comes to order of operations and that I’m doing each step correctly.

Writing Pong in Rust for my OS Written in Rust

This post is part of a larger effort you can view over here: https://osblog.stephenmarz.com.

Pong being played on my custom, RISC-V OS in Rust!

Video

Contents

  1. Overview
  2. Application Programmer’s Interface (API)
  3. Starting Routines
  4. System Calls
  5. Drawing Primitives
  6. Event Handling
  7. Start Our Game
  8. Game Loop
  9. PLAY

Overview

We last left off writing a graphics driver and an event driver for our operating system. We also added several system calls to handle drawing primitives as well as handling keyboard and mouse inputs. We are now going to use those to animate the simple game of pong. Just like hello world is the test of every programming language, pong is a test of all of our graphics and event systems.


Application Programmer’s Interface (API)

Since we’re writing in user space, we need to use the system calls we developed to (1) enumerate the framebuffer, (2) invalidate sections of the framebuffer, (3) read keyboard/click events, and (4) read motion events–mouse or tablet.

For Rust, we are still using the riscv64gc-unknown-none-elf target, which doesn’t contain a runtime. Now, riscv64gc-unknown-linux-gnu does exist, however it uses Linux system calls, which we haven’t implemented, yet. So, we are required to create our own system call interface as well as runtime!


Starting Routines

Many programmers know that ELF allows our entry point to start anywhere. However, most linkers will look for a symbol called _start. So, in Rust, we have to create this symbol. Unfortunately, after many attempts, I could not get it to work fully in Rust. So, instead, I used the global_asm! macro to import an assembly file. All it does is call main and invokes the exit system call when main returns.

.section .text.init
.global _start
_start:
.option push
.option norelax
	la	gp, __global_pointer$
.option pop
	li		a0, 0
	li		a1, 0
	call	main
	# Exit system call after main
	li	a7, 93
	ecall
.type _start, function
.size _start, .-_start

I did fail to mention about the global pointer. Since global variables can be stored far away, we use a register to store the top of this location. Luckily, most linkers have a symbol called __global_pointer$ (yes, even with the $) that we put into the gp (global pointer) register. I’m not going to cover relaxation, but this is necessary.

This is all that we need to get into Rust. However, since we’re still in baremetal Rust, we have to define the same symbols we did for our OS, including the panic and abort handlers.

#![no_std]
#![feature(asm, panic_info_message, lang_items, start, global_asm)]

#[lang = "eh_personality"]
extern "C" fn eh_personality() {}

#[panic_handler]
fn panic(info: &core::panic::PanicInfo) -> ! {
	print!("Aborting: ");
	if let Some(p) = info.location() {
		println!("line {}, file {}: {}", p.line(), p.file(), info.message().unwrap());
	} else {
		println!("no information available.");
	}
	abort();
}
#[no_mangle]
extern "C" fn abort() -> ! {
	loop {
		unsafe {
			asm!("wfi");
		}
	}
}

This allows our code to at least compile, but as you can see we need to import the assembly file as well as define a main.

global_asm!(include_str!("start.S"));
#[start]
fn main(_argc: isize, _argv: *const *const u8) -> isize {
    0
}

Now we have an entry point for Rust. Now, we can create the println and print macros to make Rust be more Rusty for us.

#[macro_export]
macro_rules! print
{
	($($args:tt)+) => ({
			use core::fmt::Write;
			let _ = write!(crate::syscall::Writer, $($args)+);
			});
}
#[macro_export]
macro_rules! println
{
	() => ({
		   print!("\r\n")
		   });
	($fmt:expr) => ({
			print!(concat!($fmt, "\r\n"))
			});
	($fmt:expr, $($args:tt)+) => ({
			print!(concat!($fmt, "\r\n"), $($args)+)
			});
}

System Calls

We need to create an API for system calls, since that will be how we get and put certain data to our operating system. Generally, the runtime will establish this for us, but again, we’re in baremetal Rust.

use core::fmt::{Write, Error};
use crate::event::Event;

pub struct Writer;

impl Write for Writer {
	fn write_str(&mut self, out: &str) -> Result<(), Error> {
		for c in out.bytes() {
			putchar(c);
		}
		Ok(())
	}
}

pub fn putchar(c: u8) -> usize {
    syscall(2, c as usize, 0, 0, 0, 0, 0, 0)
}

pub fn sleep(tm: usize) {
    let _ = syscall(10, tm, 0, 0, 0, 0, 0, 0);
}

pub fn get_fb(which_fb: usize) -> usize {
    syscall(1000, which_fb, 0, 0, 0, 0, 0, 0)
}

pub fn inv_rect(d: usize, x: usize, y: usize, w: usize, h: usize) {
    let _ = syscall(1001, d, x, y, w, h, 0, 0);
} 

pub fn get_keys(x: *mut Event, y: usize) -> usize {	
    syscall(1002, x as usize, y, 0, 0, 0, 0, 0)
}

pub fn get_abs(x: *mut Event, y: usize) -> usize {	
    syscall(1004, x as usize, y, 0, 0, 0, 0, 0)
}

pub fn get_time() -> usize {
     syscall(1062, 0, 0, 0, 0, 0, 0, 0)
}

pub fn syscall(sysno: usize, a0: usize, a1: usize, a2: usize, a3: usize, a4: usize, a5: usize, a6: usize) -> usize {
    let ret;
    unsafe {
    asm!("ecall",
        in ("a7") sysno,
        in ("a0") a0,
        in ("a1") a1,
        in ("a2") a2,
        in ("a3") a3,
        in ("a4") a4,
        in ("a5") a5,
        in ("a6") a6,
        lateout("a0") ret);
    }
    ret
}

MAN, I love the new Rust asm! It is very clear what’s happening when you look at it. Usually when I use inline assembly, register selection is extremely important. This makes it rather foolproof–dare I actually say that?

Just like we did for our operating system, we use the Write trait to hook into the format macro that is given to us by Rust.


Drawing Primitives

Ok, with the system calls and the startup code in start.S, we now have everything we need to start programming the game. So, let’s make some drawing primitives. Since this is pong, we only really care about drawing rectangles.

#[repr(C)]
#[derive(Clone,Copy)]
pub struct Pixel {
    pub r: u8,
    pub g: u8,
    pub b: u8,
    pub a: u8,
}

pub type Color = Pixel;

impl Pixel {
    pub fn new(r: u8, g: u8, b: u8) -> Self {
        Self {
            r, g, b, a: 255
        }
    }
}

pub struct Vector {
    pub x: i32,
    pub y: i32
}

impl Vector {
    pub fn new(x: i32, y: i32) -> Self {
        Self {
            x,  y
        }
    }
}

pub struct Rectangle {
    pub x: i32,
    pub y: i32,
    pub width: i32,
    pub height: i32,
}

impl Rectangle {
    pub fn new(x: i32, y: i32, width: i32, height: i32) -> Self {
        Self {
            x, y, width, height
        }
    }
}
pub struct Framebuffer {
    pixels: *mut Pixel
}

impl Framebuffer {
    pub fn new(pixels: *mut Pixel) -> Self {
        Self { pixels }
    }
    pub fn set(&mut self, x: i32, y: i32, pixel: &Pixel) {
        unsafe {
            if x < 640 && y < 480 {
                let v = (y * 640 + x) as isize;
                self.pixels.offset(v).write(*pixel);
            }
        }
    }
    pub fn fill_rect(&mut self, rect: &Rectangle, color: &Pixel) {
        let row_start = rect.y;
        let row_finish = row_start + rect.height;
        let col_start = rect.x;
        let col_finish = col_start + rect.width;
        for row in row_start..row_finish {
            for col in col_start..col_finish {
                self.set(col, row, color);
            }
        }
    }
}

pub fn lerp(value: i32, mx1: i32, mx2: i32) -> i32 {
    let r = (value as f64) / (mx1 as f64);
	return r as i32 * mx2;
}

Now we have a pixel structure, a vector, a rectangle, and a framebuffer. The pixel comes from the operating system, so it is important that we control that structure, which necessitates #[repr(C)].


Event Handling

We can draw, so now we need to be able to handle input. We can create an event structure to handle this.

#[repr(C)]
#[derive(Copy, Clone)]
pub struct Event {
    pub event_type: u16,
	pub code: u16,
	pub value: u32
} 

impl Event {
	pub fn empty() -> Self {
		Self {
			event_type: 0,
			code: 0,
			value: 0
		}
	}
	pub fn new(event_type: u16, code: u16, value: u32) -> Self {
		Self {
			event_type,
			code,
			value
		}
	}
}

// Key codes
pub const KEY_RESERVED: u16 = 0;
pub const KEY_ESC: u16 = 1;
pub const KEY_1: u16 = 2;
pub const KEY_2: u16 = 3;
// ... CLIP ...
pub const KEY_END: u16 = 107;
pub const KEY_DOWN: u16 = 108;

// mouse buttons

pub const BTN_MOUSE: u16 = 0x110;
pub const BTN_LEFT: u16 = 0x110;
pub const BTN_RIGHT: u16 = 0x111;
pub const BTN_MIDDLE: u16 = 0x112;


// mouse movement

pub const ABS_X: u16 = 0x00;
pub const ABS_Y: u16 = 0x01;
pub const ABS_Z: u16 = 0x02;

Many of the constants came from libevdev’s input-event-codes.h. Unfortunately, that’s in C, so a little Python script could make it Rust.

Just like the Pixel structure, the Event structure is defined by our operating system, so we are required to control it ourselves (hence #[repr(C)]).


Start Our Game

const MAX_EVENTS: usize = 25;
const GAME_FRAME_TIMER: usize = 1000;

#[start]
fn main(_argc: isize, _argv: *const *const u8) -> isize {
	use drawing::Framebuffer;
	use drawing::Pixel;

	let ufb = syscall::get_fb(6) as *mut Pixel;
	let mut fb = Framebuffer::new(ufb);
	let background_color = drawing::Pixel::new(25, 36, 100);
	let mut event_list = [event::Event::empty(); MAX_EVENTS];
	let event_list_ptr = event_list.as_mut_ptr();

	let player_color = drawing::Pixel::new(255, 0, 0);
	let npc_color = drawing::Pixel::new(0, 255, 0);
	let ball_color = drawing::Pixel::new(255, 255, 255);

	let mut game = pong::Pong::new(player_color, npc_color, ball_color, background_color);
	// GAME LOOP HERE
	println!("Goodbye :)");
	0
}

You can see the first thing we do is grab a framebuffer. If you recall from the operating system tutorial, our operating system will map the pixels into our application’s memory space. We can update the pixels as we see fit, but to actually realize it on the screen, we must invalidate the given pixels, which is a separate system call in our OS.

You can see we have a structure called Pong. This structure contains the routines we need to make this somewhat a game. We have a timing function called advance_frame, and we have a way to get the game onto the screen using the draw function.

This is pretty gross, but here’s the Pong structure’s advance frame and draw routines (the rest can be found on GitHub):

impl Pong {
	pub fn advance_frame(&mut self) {
		if !self.paused {
            self.move_ball(self.ball_direction.x, self.ball_direction.y);
			let miss = 
			if self.ball.location.x < 40 {
				// This means we're in the player's paddle location. Let's
				// see if this is a hit or a miss!
				let paddle = (self.player.location.y, self.player.location.y + self.player.location.height);
				let ball = (self.ball.location.y, self.ball.location.y + self.ball.location.height);

				if paddle.0 <= ball.0 && paddle.1 >= ball.0 {
					false
				}
				else if paddle.0 <= ball.1 && paddle.1 >= ball.1 {
					false
				}
				else {
					true
				}
			}
			else {
				false
			};
			if miss {
				self.reset();
				self.paused = true;
			}
			else {
				if self.ball.location.x < 40 || self.ball.location.x > 580 {
					self.ball_direction.x = -self.ball_direction.x;
				}
				if self.ball.location.y < 20 || self.ball.location.y > 430 {
					self.ball_direction.y = -self.ball_direction.y;
				}
				let new_loc = self.ball.location.y - self.npc.location.height / 2;
				self.npc.location.y = if new_loc > 0 { new_loc } else { 0 };
			}
		}
	}
	pub fn draw(&self, fb: &mut Framebuffer) {
		fb.fill_rect(&Rectangle::new(0, 0, 640, 480), &self.bgcolor);
		fb.fill_rect(&self.player.location, &self.player.color);
		fb.fill_rect(&self.npc.location, &self.npc.color);
		fb.fill_rect(&self.ball.location, &self.ball.color);
	}
}

So our game will move the ball given the direction vector every frame due to advance_frame being called from our game loop. We also have the ability to pause the game (or unpause). Finally, we have very basic collision detection for the player’s paddle.


Game Loop

Now we need to put all of this together using a game loop, which will handle input events as well as advance the animation.

  // GAME LOOP
  gameloop: loop {
		// handle mouse buttons and keyboard inputs
		// println!("Try get keys");
		let num_events = syscall::get_keys(event_list_ptr, MAX_EVENTS);
		for e in 0..num_events {
			let ref ev = event_list[e];
			// println!("Key {}  Value {}", ev.code, ev.value);
			// Value = 1 if key is PRESSED or 0 if RELEASED
			match ev.code {
				event::KEY_Q => break 'gameloop,
				event::KEY_R => game.reset(),
				event::KEY_W | event::KEY_UP => game.move_player(-20),
				event::KEY_S | event::KEY_DOWN => game.move_player(20),
				event::KEY_SPACE => if ev.value == 1 { 
					game.toggle_pause();
					if game.is_paused() {
						println!("GAME PAUSED");
					}
					else {
						println!("GAME UNPAUSED")
					}
				},
				_ => {}
			}
		}
		game.advance_frame();
		game.draw(&mut fb);
		syscall::inv_rect(6, 0, 0, 640, 480);
		syscall::sleep(GAME_FRAME_TIMER);
	}

PLAY

Our event loop uses W to move the paddle up or D to move the paddle down. Space toggles the pause and R resets the game.

Hmmm. When I started writing this, it seemed more impressive in my mind. Oh well, have fun!

Why did I catch s*** for my use of the RISC-V SFENCE.VMA instruction?

This post is part of a longer OS tutorial which can be found here: https://osblog.stephenmarz.com

Contents

  1. Introduction
  2. What is SATP?
  3. What is SFENCE.VMA?
  4. What is happening?
  5. The Translation Lookaside Buffer
  6. Conclusion
  7. References

Introduction

My last post garnered some attention by those telling me that I “forgot” to execute an SFENCE.VMA after I wrote to the SATP register–some with more tact than others. This post is here to clarify why I did what I did, and to clarify what the specification actually tells us needs to be done.


What is SATP?

The supervisor address translation and protection (SATP) register is a register that tells the MMU what mode it’s in, what address space it is working in, and where to find the first level page table in RAM (this would be level 2 for Sv39).

The SATP Register Fields

The SATP register stores three pieces of information, the MODE, the address space identifier (ASID), and the physical page number (PPN).

The MODE

If the MODE=0, then the MMU is turned off and any address is not translated.

If MODE=8, we’re in Sv39 (Supervisor 39-bit) mode which means that our virtual addresses are 39-bits long.

There are other modes, but I won’t cover them here.

The ASID

The address space identifier tags translations with a unique identifier.

The PPN

The physical page number is the upper 44-bits of a 56-bit memory address where we can find the first level page table.


What is SFENCE.VMA?

In absolute terms, this means supervisor fence.virtual memory address. In real terms, it will flush the cache’s so that the MMU will “see” the new changes in memory. This means that the MMU will be forced to look in RAM where the page tables are stored.

The SFENCE.VMA instruction has several different ways to execute it as shown in the specification. This should clue us in to the fact that executing SFENCE.VMA every time we write to SATP might not be so cut and dry.


What is Happening?

So, why is this not straightforward? The issue is that “walking” a page table–meaning going into RAM and translating a virtual address into a physical address–is not a fast process. There are multiple levels of page tables, and several 8-byte entries that need to be dereferenced by the memory controller.

We can speed these walks up by “caching” frequently translated addresses into a table. The table has the virtual address number and the physical address number, so translation is just a table lookup instead of dereferencing several levels of page tables.

This caching can be speculative. If the MMU doesn’t speculate, then the first time we translate an address, that address will not be in the TLB (the cache table) and we will get what is known as a compulsory miss. If we speculate, we can predict the addresses that will be used and when the memory controller isn’t doing anything else, we can load the virtual address and physical address into cache.

This speculation is one of the reasons for SFENCE.VMA. Another reason is due to the fact that when we translate a page using the MMU, it stores the most recent translations in the TLB as well.


The Translation Lookaside Buffer

The translation lookaside buffer or TLB is a fancy term for the MMU cache. It stores the most recent translations to exploit temporal locality–that is, the chances we’re going to translate the same address near in the future is likely. So, instead of having to walk the page tables all over again, we just look in the TLB.

The TLB has several entries, and with RISC-V, it stores an address space identifier (ASID). The address space identifier allows the TLB to store more entries than just the most recent page table. This has always been a problem with TLBs, including with the Intel/AMD processor. Writing to its MMU register (called CR3 for control register #3) will cause a TLB flush. This is NOT the case with RISC-V writing to the SATP register (the MMU register in RISC-V).

The specification just gives the general rules for a manufacturer to use. Therefore, the manufacturer can choose how they want to implement their MMU and TLB as long as it complies with RISC-V’s privileged specification. Here’s a simple implementation of a TLB that complies with the privileged specification.


Conclusion

The RISC-V specification doesn’t make it very clear, but you can see clarification on the spec’s github repository. If the MODE is not 0, then the MMU is allowed to speculate, meaning it can pre-populate the MMU based on the addresses it thinks will need to be translated in the near future. The specification allows this, but the MMU cannot throw a page fault if a speculatory translation is invalid.

So, bottom line — SFENCE.VMA should NOT be called every time SATP is changed. This will cause TLB thrashing since every time you context switch, you will need to change the SATP register to the kernel page table, schedule, then change the SATP register to the new scheduled process’ page table.

Instead, the SFENCE.VMA instruction should be invoked when one or more of the following occur:

  1. When software recycles an ASID (i.e., reassociates it with a different page table), it should first change satp to point to the new page table using the recycled ASID, then execute SFENCE.VMA with rs1=x0 and rs2 set to the recycled ASID. Alternatively, software can execute the same SFENCE.VMA instruction while a different ASID is loaded into satp, provided the next time satp is loaded with the recycled ASID, it is simultaneously loaded with the new page table.
  2. If the implementation does not provide ASIDs, or software chooses to always use ASID 0, then after every satp write, software should execute SFENCE.VMA with rs1=x0. In the common case that no global translations have been modified, rs2 should be set to a register other than x0 but which contains the value zero, so that global translations are not flushed.
  3. If software modifies a non-leaf PTE, it should execute SFENCE.VMA with rs1=x0. If any PTE along the traversal path had its G bit set, rs2 must be x0; otherwise, rs2 should be set to the ASID for which the translation is being modified.
  4. If software modifies a leaf PTE, it should execute SFENCE.VMA with rs1 set to a virtual address within the page. If any PTE along the traversal path had its G bit set, rs2 must be x0; otherwise, rs2 should be set to the ASID for which the translation is being modified.
  5. For the special cases of increasing the permissions on a leaf PTE and changing an invalid PTE to a valid leaf, software may choose to execute the SFENCE.VMA lazily. After modifying the PTE but before executing SFENCE.VMA, either the new or old permissions will be used. In the latter case, a page fault exception might occur, at which point software should execute SFENCE.VMA in accordance with the previous bullet point.

Unfortunately, you have to dig through the issues and updates to the specification on GitHub to find out some of this information. I have provided links in references below.


References

Dynamic Linking

Static vs dynamic linking usually has to do with our tolerance for the size of our final executable. A static executable contains all code necessary to run the executable, so the operating system loads the executable into memory, and it’s off to the races. However, if we keep duplicating code over and over again, such as printf, then it starts to use up more and more space. So, a dynamic executable means that we only store stubs in the executable. Whenever we want to access printf, it goes out to a dynamic linker and loads the code essentially on demand. So, we sacrifice a tiny bit of speed for a much smaller executable.


Contents

  1. What is Linking
  2. Finding Symbols
  3. Static Libraries (archives)
  4. Dynamic Libraries (shared objects)
  5. Analyzing a Shared Program
  6. Unresolved Symbols at Run Time
  7. Procedure Linkage Table (plt)
  8. Global Offset Table (got)
  9. Loading and Updating the GOT
  10. Conclusion and Further Reading
  11. Video

What is Linking?

When I hear someone talk about compiling their program into an executable, they are really eliding over several stages. The definition of compiling is to produce something by combining information collected from different sources. In computing, we generally think of compiling as turning a higher-level language, such as C, into a lower-level code, such as assembly.

The final stage before we get an executable is the linking stage. This is where we link (hence the name) all sources together to produce one coherent executable. This is also where all outstanding symbols need to be resolved. Symbol is just fancy for the name of a function or global variable.

We can take a look at object code, which is what we get after we assemble but before we link. Here’s an object dump of an example program.

#include <stdio.h>
#include <stdlib.h>
void some_func(int a, int b);

int main(int argc, char *argv[]) {
	if (argc < 3) {
		printf("Not enough arguments.\n");
		return -1;
	}

	int a = atoi(argv[1]);
	int b = atoi(argv[2]);

	some_func(a, b);

	return 0;
}

This code produces the following object code, which I disassemble using objdump for RISC-V.

Notice that the function some_func has been prototyped, but has not been defined. This will be the responsibility of the linker to find the symbol some_func and add it into our program. Notice what happens when I try to link this program without ever defining some_func.

/opt/riscv_1/lib/gcc/riscv64-unknown-linux-gnu/9.2.0/../../../../riscv64-unknown-linux-gnu/bin/ld: /tmp/ccmgFc75.o:

in function .L2': test.c:(.text+0x62): undefined reference to some_func'
collect2: error: ld returned 1 exit status

The linker is looking for the symbol some_func, but it cannot find it, so we get an undefined reference. We know this is at the linker stage because the error is “ld returned 1 exit status”. The “ld” means “linker”.

We also can see that the address of the function main is 0, this is because we haven’t linked a program. So, our object code just contains bubbles of code, which will then be placed into our executable at certain locations by the linker.


Finding Symbols

If we use the command nm, which is used to list symbols in an object file, we can see all of the unresolved symbols. Our object code is looking for these, but it doesn’t need to know where they are until we have a full executable program.

You can see that in this object file, the linker has a little bit of work to do. It must find atoi, puts, and some_func, which are flagged as U for undefined symbols. When we execute the linker, we will specify certain libraries, such as -lc (the C library), which most of these symbols will be found. Our some_func has never been defined, so our linker cannot really succeed until we define it somewhere.


Static Libraries (archives)

Archive files generally end in .a and contain code and other information that will be added to the final executable. Essentially, archive files are just a collection of object files into one file. Therefore, when we link to an archive file, we extract the actual code from the object code into our executable. The nice thing about archive files is that we don’t need to add ALL symbols into our executable–we only need those symbols that need to be used.

Usually, we have archive versions of all libraries to allow for static executables. These executables are those that don’t need any additional loading to run. In other words, these are self-contained files. The linker stage will pull in the code directly from the .a file into the executable, and all is done. The more code that the linker pulls in, the larger the executable will be.

Let’s go ahead and define some_func and see how linking works.

#include <stdio.h>

void some_func(int a, int b)
{
	printf("Data = %d\n", a * b);
}

It’s very simple, but the point is to have some code that the linker can pull in. Recall that the linker gave us an “undefined reference” error because we didn’t define some_func. Now, we have it defined, so let’s see what happens.

riscv64-unknown-linux-gnu-gcc -static -o test test2.o test.o

Notice that I’m using gcc still. This will automatically invoke the linker and pull in the necessary start files. If we invoke the linker directly, we have to define _start and tell the program how to start, or we would have to directly specify the start files.

I specify -static so that gcc will pull in only .a files. When we are done, the file test will be a fully-contained, static executable. If we take a look at this file, we can see that all of the “stuff” we need for this executable makes it a fairly large file.

-rwxr-xr-x 1 smarz smarz 4687920 Jun 1 09:16 test

Yes, that’s 4,687,920 bytes or about 4.7 megabytes. However, if we looked at the symbol table, we will find that NO symbol is unresolved, and therefore this is a self contained executable. If we load this with our elf loader, no external resources will need to be pulled in.

Our linker must exhaustively follow every possible route and pull in those symbols even though they may never be called. We can see the symbol table is enormous due to all of the calls and global variable (such as errno).

00000000000101b0 T abort
000000000006bb38 S __abort_msg
0000000000029aa8 t add_alias2.isra.0.part.0
00000000000497c6 t add_fdes
00000000000297ea t add_module.isra.0
000000000003aa4e t add_name_to_object.isra.0
000000000003ab5c t add_path.isra.0.constprop.0
000000000006b8d0 d adds.8114
000000000002fe2e T __add_to_environ
00000000000431b6 t add_to_global
000000000001adb4 t adjust_wide_data
000000000006bba0 V __after_morecore_hook
0000000000012d3a t alias_compare
000000000002338a W aligned_alloc
000000000006bb80 s aligned_heap_area
000000000003ff04 t allocate_dtv
000000000003a40c T __alloc_dir
000000000006ca38 b any_objects_regis

Dynamic Libraries (shared objects)

Dynamic libraries end in .so, which stands for shared object. These libraries contain code that will not be added directly into the executable. Instead a program called the dynamic linker will be responsible for taking code from the .so files and adding them into an executing program. We can also add symbols ourselves using the -ldl (dynamic linker) library.

We can think of the term dynamic as run-time. That is, we don’t actually load the code into our program until the program actually runs. We can see this with something as simple as printf. We can examine our executable and we don’t see printf’s code. Instead, we see printf’s stub. This stub will then be replaced by the dynamic loader when the program executes.

When we link with a dynamic, shared object, those symbols that can be added at run time will remain unresolved. We will have the symbols’ names put into a table just so we know it’s out there somewhere. However, with shared objects, we now can get that unresolved reference (or symbol) at run-time! If you have arch-linux and have ever compiled anything yourself, you might’ve run into this phenomenon.


Making a Shared Library

Let’s go ahead and turn our test2 file into a shared object. This is fairly easy to do with gcc:

riscv64-unknown-linux-gcc -fPIC -shared -o libtest2.so test2.o

The switch -fPIC stands for position independent code. That means all offsets cannot be relative outside of the library itself, and this is usually the case for most shared libraries. In this case, the generated code will be placed into a table to find the offsets. This is required since the library, or specific symbols in the library, can be loaded into any location by the dynamic linker.

Then, we can compile our test program to link to this shared library:

riscv64-unknown-linux-gcc -o test test.o -L. -ltest2

In the command above, I am specify the library search path with -L, so -L. means to look in the current directory. Then, I specify the library to link with using -ltest2. GCC will automatically prepend lib and append .so to make libtest2.so.


Analyzing A Shared Program

First, we compile in the location of our library. We can see these shared libraries using the ldd command.

smarz@DOMO:~ $ ./test
./test: error while loading shared libraries: libtest2.so: cannot open shared object file: No such file or directory

So, when I run my program, it looks at a certain path to find your libraries. This is analogous to the PATH environment variable, except in Linux (and some UNIXes), we use LD_LIBRARY_PATH. Shown below, if I change my path so the dynamic linker can find my library, it functions properly:

smarz@DOMO:~ $ LD_LIBRARY_PATH=/home/smarz ./test 10 20
Data = 200

Unresolved at Run Time

When we link a program, many of the symbols will remain unresolved. This tells us what symbols the dynamic linker is responsible for loading from the shared objects (libraries). We can see these unresolved symbols using the nm command:

Partial list of symbols in our test program

You can see that puts is unresolved (capital U letter), and so is some_func, since we now put it in a shared library. These symbols have stubs that will eventually invoke the interpreter to load the actual code for that function.

The dynamic linker’s job is to resolve these symbols when we approach them. In fact, we can take a look at the produced ELF executable to see what it wants to use to resolve symbols.

You can see the INTERP section (interpreter), which requests the dynamic linker by name. If we run this dynamic linker, we can see its purpose.

To resolve these symbols, the linker will put in a loader in place of the actual function. When we run our program, some_func’s assembly instructions are no where to be found in our program. Instead, when we make a function call to some_func, it will refer us to the procedure linkage table (PLT). The procedure linkage table will then lookup where the program is loaded by referring to the global offset table (GOT). At first, the address in the GOT is the code that loads the program (the dynamic linker). After the dynamic linker loads our some_func code, it changes the GOT to reflect where it put some_func in memory.

A flow chart representing dynamically linked functions.

Procedure Linkage Table (PLT)

If we take a look at the functions and other symbols we’re going to use in our program, we notice that they’re very, very short, and they look almost identical to each other. This is because we have a stub, which loads the program from the procedure linkage table, PLT.

Our some_func loads from -1028(t3) and jumps to that location. The value of t3 is the global offset table (GOT), which I cover below. We can see that -1028(t3) is the address 12018, which is right in the meat of the GOT.

So, we go to the function, which then loads from the global offset table, which then points us to the procedure linkage table. In this procedure linkage table, if the procedure has been used before, the code for that procedure, such as puts or some_func, will be loaded into the table. Otherwise, it will be a stub function whose job it is to invoke the interpreter and load the code.


Global Offset Table (GOT)

The global offset table is a large table that contains offsets that we can load from. Here is the table in our test program.

Don’t worry about the assembly instructions–these aren’t instructions, but objdump is trying to disassemble them anyway.


Loading and Updating the GOT

Notice that all functions originally point to 0x0000_0000_0001_04a0, which goes back to the procedure linkage table. This is where the interpreter routine is located which loads the symbols. When the symbol is loaded by the interpreter, it will modify the global offset table to reflect where it loaded it into memory. We can look at this by directly referencing the memory being used.

Let’s modify our program to see what we get here:

#include <stdio.h>
#include <stdlib.h>
void some_func(int a, int b);

int main(int argc, char *argv[]) {
	if (argc < 3) {
		printf("Not enough arguments.\n");
		return -1;
	}

	int a = atoi(argv[1]);
	int b = atoi(argv[2]);
	volatile unsigned long *ptr = (unsigned long *)0x12018;

	printf("0x%016lx = 0x%016lx\n", ptr, *ptr);

	some_func(a, b);

	printf("0x%016lx = 0x%016lx (some_func is at 0x%016lx)\n", ptr, *ptr, some_func);

	return 0;
}

In our modified program, we look at 0x12018, which was where our some_func stub was located in the global offset table (remember some_func@plt referred us to this location in the GOT).

Notice that at first the global offset table points to 0x104e0, which is the code used to invoke the interpreter which in turn loads the symbols. Notice that when we load some_func, the global offset table has been updated with the memory address 0x400_0081_b460. This is where our dynamic linker loaded the code for our some_func function.

Since I updated the function, some of the offsets are no longer what they were for our original program. However, to show you that the some_func function itself points to the procedure linkage table, let’s take a look at the updated object dump:

When I print out some_func’s memory address, we get 0x10510, which is where it is located in the procedure linkage table. However, if we look at the table, notice that it grabs the address of the actual function from the global offset table. In this case, the stub is at 0x104e0. After some_func is loaded (after its first invocation), the global offset table is updated to reflect 0x400081b460, which is where the actual assembly instructions are located for some_func.


Conclusion

If we break down what’s happening here, there is no magic happening. For dynamic symbols, we just have a table that stores their addresses. The table’s location is fixed, so we can compile that directly into our program. However, what the table refers to is dynamic, so it can be updated by the dynamic linker ld.so with wherever it put our code. That’s it!


Video

Assembly’s Perspective of C

Posted to: https://blog.stephenmarz.com/2020/05/20/assemblys-perspective/

If you are living under a rock, quarantined against COVID-19, here’s quite a long, and somewhat technical, post for your consumption. This post will talk about how C and assembly dance together–the tango? waltz? who knows?

I usually write about operating systems, but I have not been able to write an operating system completely in one language. It probably can be done, but for me I write the bootloader and some utilities purely in assembly.


Contents

  1. How C Relates to Assembly
  2. Why Assembly?
  3. C Calling Conventions
  4. Register Sizes
  5. Caller vs Callee saved registers
  6. C and Assembly Code
  7. Pointers and References
  8. Structures
  9. Arrays
  10. Globals
  11. Conditions and loops
  12. Data types
  13. Conclusion

How C Relates to Assembly

I will be using the RISC-V architecture to describe how C translates into assembly. The nice thing about C is that there are only a fair amount of abstractions, so we can take a statement of C and translate it into assembly.

I teach my students to develop their logic in a higher-level language (mainly C++) and then translate into assembly. Assembly requires you to keep several things in your head at once, even to write a simple for loop. Luckily, there are some macros in assembly to make it a bit easier.

Let’s not forget that when we compile C code, we get assembly files. You can see the intermediate assembly file by using the -S switch to gcc. Otherwise, gcc will typically put the assembly file in the /tmp/ directory.

gcc -S test.c

Technically, C doesn’t have to compile into assembly, but as far as I know, nearly all compile into some lower-level language, and not directly into machine code (object code).

So, to get code into machine instructions, we have to follow the steps below:

An executable is a special type of machine code. Executables are usually in a specific format that the operating system can understand. However, inside of the executable is a section called the text section. This is where the machine code is stored. In Linux, this is called an executable and linkable format or ELF file. In Windows, this is called a portable executable or PE.


Why Assembly?

Assembly has many additional abstractions on top of it, including macros, labels, and pseudo-instructions. Yes, that’s right, assembly will do many things for you. However, the nice thing about assembly is that one instruction more-or-less is really one instruction (except maybe with pseudo-instructions). That is, we don’t let any compiler tell us how it wants to do things. If we want to make the addition operator (+) preferred over the multiplication operator (*), we can do it. Why would we do that? Uhh.. Ask someone else.

Another reason we use assembly over a high-level language is to use instructions that C or C++ cannot really use. For example, the advanced vector extensions (AVX) for an Intel processor don’t have a 1-to-1 conversion from C. Now, many compilers have done some work with this using what are known as intrinsics. (EDITED): These instrinsics are assembly wrappers on top of C. So, we can’t get by with C doing the hard part for us. Where as int i = 10 * z; will produce the assembly we need to get the result, the intrinsics are C-syntax for assembly instructions. So, all in all, we still need to know what the assembly is doing.

An optimizing compiler might be able to make some improvements with the assembly, but let’s take a look at two different ways to do matrix multiplication using SSE in Intel. One, we use the intrinsic instructions in C and in the other, we use assembly directly.

#include <stdio.h>
#include <pmmintrin.h>

void calc_intrin(float result[], float matrix[], float vector[]);
void calc_asm(float result[], float matrix[], float vector[]);

int main() {
   int row, col;
   float vec[] = {1.0, 10.0, 100.0, 1000.0};
   float mat[] = {2.0, 0.0, 0.0, 0.0,
		       0.0, 2.2, 0.0, 0.0,
		       0.0, 0.0, 22.2, 0.0,
		       0.0, 0.0, 0.0, 22.22};
   float result[4];

   calc_intrin(result, mat, vec);
   printf("%5.3f %5.3f %5.3f %5.3f\n", result[0], result[1], result[2], result[3]);

   calc_asm(result, mat, vec);
   printf("%5.3f %5.3f %5.3f %5.3f\n", result[0], result[1], result[2], result[3]);
   return 0;
}
void calc_intrin(float result[], float matrix[], float vector[])
{
   int row;
   __m128 vec = _mm_loadu_ps(vector);
   for (row = 0;row < 4;row++) {
      __m128 rowvec = _mm_loadu_ps(&matrix[row * 4]);
      __m128 rowvec2 = _mm_mul_ps(vec, rowvec);
      __m128 rowvec3 = _mm_hadd_ps(rowvec2, rowvec2);
      __m128 rowvec4 = _mm_hadd_ps(rowvec3, rowvec3);
      _mm_store_ss(&result[row], rowvec4);
   }
}

The code above multiplies a 4×4 matrix with a 4 element vector. However, if you look at how to multiply matrix by a vector, you see that we can parallelize each row, which is what I’m doing up here. So, first, we loadu (load unaligned) using ps, which stands for packed single. This means that it will grab 4, 32-bit floats and put them into a single 128-bit register. Then we multiply the matrix row with the vector. That’ll give us 4 elements all multiplied together. Then, we need to add all four elements together to get a single value, which is the result of the vector at that element. Each haddps instruction adds two elements together. To get all four elements added, we run the instruction twice.

You can see that in C, this looks fairly easy to do, but we still need knowledge of what the instruction is doing. If I write the assembly, as below, I can group certain operations and use the larger register set to my advantage. See below.

.intel_syntax noprefix
.section .text
.global calc_asm
calc_asm:
   movupd xmm0, [rsi + 0]
   movupd xmm1, [rsi + 16]
   movupd xmm2, [rsi + 32]
   movupd xmm3, [rsi + 48]

   movupd xmm4, [rdx]

   mulps xmm0, xmm4
   mulps xmm1, xmm4
   mulps xmm2, xmm4
   mulps xmm3, xmm4

   haddps xmm0, xmm0
   haddps xmm0, xmm0

   haddps xmm1, xmm1
   haddps xmm1, xmm1

   haddps xmm2, xmm2
   haddps xmm2, xmm2

   haddps xmm3, xmm3
   haddps xmm3, xmm3

   movss [rdi + 0], xmm0
   movss [rdi + 4], xmm1
   movss [rdi + 8], xmm2
   movss [rdi + 12], xmm3
   ret

The code above generates the following output:

Either way you slice it, we’re writing assembly either in a .S file or as the intrinsics, such as _mm_haddps.


C Calling Conventions

C has a standard for most architectures called the ABI, which stands for application binary interface. We use this standard so that any language can talk to another, as well as the architecture, provided they use the standard. Some things about the C calling conventions include what registers get what values whenever we call a function. Furthermore, it specifies which registers can be destroyed by the called function and which must be preserved. In RISC-V, the actual RISC-V specification has an Assembly Programmer section, which defines the following for the registers.

You can see that the registers have two names, one that starts with x and the other that starts with a, s, t, and some others. These are referring to the same registers, but x just gives you a number for that register. The a stands for argument, the t stands for temporary, and the s stands for saved. These are called ABI names for the registers, and then the description tells you what the ABI recommends each register should do. Finally, notice that there is a column called Saver. Again, we have to know which registers we’re responsible for.

Register Sizes

The load and store instructions can have one of four suffixes: b, h, w, or d, which stands for byte, halfword, word, and doubleword, respectively. A word in RISC-V is 32 bits or 4 bytes.

Each register in RISC-V is either 32 bits (for RV32) or 64 bits (for RV64). So, even though we load a byte from memory, it is still being stored into a bigger register. With the vanilla load and store instructions, the value will sign-extend from 8 bits to 64 bits.

Caller vs Callee Saved

Saving registers is important because we call both scanf and printf. A caller saved register means that when we call scanf and printf, we are NOT guaranteed that the value we put in that register is the same as when the function returns. Most argument and temporary registers follow this convention.

A callee-saved register means that when we call scanf and printf, we are guaranteed that the value we put in these registers is the same when scanf and printf return. This means that scanf and printf are allowed to use the saved registers, but they are required to restore their old values before they return.

Generally, all argument and temporary registers are free to be destroyed by a function, but all saved registers must be saved. For all of this to work, C must be aware of what to do with these registers.


C and Assembly Code

When I teach my students how to encode (from assembly to machine) and decode (from machine to assembly) instructions, it is easy to see that each instruction encodes into exactly 32-bits in RISC-V (not counting the compressed instructions). However, I’m not exactly sure how I would teach that using C. Let’s take a simple C program and see how it would translate into RISC-V assembly.

#include <stdio.h>
int main() {
   int i;
   scanf("%d", &i);
   printf("You gave %d\n", i);
   return 0;
}

This translates into RISC-V assembly as:

.section .rodata
scanf_string: .asciz "%d"
printf_string: .asciz "You gave %d\n"

.section .text
.global main
main:
   addi sp, sp, -16
   sd ra, 0(sp)
   la a0, scanf_string
   add a1, sp, 8
   call scanf
   lw a1, 8(sp)
   la a0, printf_string
   call printf
   ld ra, 0(sp)
   addi sp, sp, 16
   jr ra # same as ret instruction

The code above produces the following output. I gave 100 as an input for scanf:

First, in C, all strings are read-only global constants. The directive .asciz above means that we have an ASCII-table string with a zero on the end of it. This is how C builds a string. So, C, knows the starting memory address and goes character-by-character until it hits a zero. So, in other words, C-style-strings are arrays of characters.

.section .rodata
scanf_string: .asciz "%d"
printf_string: .asciz "You gave %d\n"

So, the portion above labels these strings as scanf_string and printf_string. A label is just like a variable–it’s just a nice name for a memory location.

.section .text
.global main
main:

The .text section is also called the code section, and it is in this section we place CPU instructions. The directive .global main makes the label, main, viewable outside of this assembly file. This is necessary because the linker will look for main. Then, we have a label called main. You can see that a function is just simply a memory label where we can find the CPU instructions for that given function.

In C, every function must have a unique name. So, a C function’s name directly corresponds with the assembly’s memory label. In this case, main is the label for int main().

addi sp, sp, -16
sd ra, 0(sp)

The sp register stores the stack pointer. The stack is memory storage for local variables. Whenever we allocate memory from the stack, we decrement. Then, we own everything between what we decremented and the original value of the stack pointer. So, in the case above, main now owns sp – 16 through (but not including) sp. Again, sp is just a memory address.

The instruction sd stands for store-doubleword. In this case, a double word is 64 bits or 8 bytes. We are storing a register called ra which stands for return address. There is only one RA register, so whenever we call another function, such as scanf and printf, it will get overwritten. The purpose of this register is so printf and scanf can find their way back to main. Since these functions can be called from anywhere, it must have a return path…hence the return address.

la a0, scanf_string
add a1, sp, 8
call scanf

Before we make a function call, we have to make sure the registers are setup before we make the call. Luckily, RISC-V makes this easy. The first argument goes into register a0, the second into a1, and so forth. For scanf, the first argument is a pointer to a string. So, we use the la (load address) instruction, which is actually a pseudo-instruction, to load the address of scanf_string into the first argument register, a0.

The second parameter is &i in C, which is the memory address where we want to store what we scan. In this case, i is a local variable, so we store it on the stack. Since bytes 0, 1, 2, …, 7 are being taken by the RA register, the next available offset we have is 8. So, we can add sp + 8 to get the memory location where we can put the value we scan.

lw a1, 8(sp)
la a0, printf_string
call printf

The code above loads a word, which is 32 bits or 4 bytes, from sp + 8, which recall is where scanf scanned our integer. This goes into the second argument, a1. Then, we load the address of the printf_string and store that memory address into the register a0.

Pointers and References

What do pointers and references mean in assembly? In C, we know that a pointer is an 8 or 4-byte memory value that stores a memory address. So, it’s like inception where a memory address stores a memory address. In assembly, we get a memory address inside of our argument registers instead of a value. Here’s an example.

void get_input(int *value) {
   printf("Enter a value: ");
   scanf("%d", value);
}

Recall that the arguments are stored in a0, a1, …, a7. So, the int *value will be in a0.

.section .rodata
printf_string: .asciz "Enter a value: "
scanf_string: .asciz "%d"

.section .text
.global get_input
get_input:
   addi sp, sp, -16
   sd ra, 0(sp)
   sd a0, 8(sp)
   la a0, printf_string
   call printf
   la a0, scanf_string
   ld a1, 8(sp)
   call scanf
   ld ra, 0(sp)
   addi sp, sp, 16
   ret

Notice that we store a0 into 8(sp). This is a memory location. So, when we ld a1, 8(sp) for the scanf call, a1 will be populated with the memory address that value points to, and hence, scanf will store the value inside of there. Notice that even though it is an integer pointer, we use ld and sd (doubleword). This is because we’re not storing the value, we’re storing the memory address, which on a 64-bit machine is always 64 bits (8 bytes).

To demonstrate pointers, lets do something simple, except let’s get a reference instead of a value:

int work(int *left, int *right) {
   return *left + *right;
}

While we pass addresses around, nothing really changes. However, when we dereference to get the value out of the memory address, we must use a load instruction.

.section .text
.global work
work:
   # int *left is in a0
   # int *right is in a1
   # Use lw to dereference both left and right
   lw a0, (a0)
   lw a1, (a1)
   # Now, add them together
   add a0, a0, a1
   ret

Whatever we leave in the a0 register is what is returned. That’s why our ABI calls the a0 register the “function arguments / return value”.


Structures

A structure is just multiple data in contiguous memory. The hard part comes from the fact that the C compiler will try to make memory accesses as efficient as possible by aligning data structures to their sizes. I will show this below.

struct SomeStruct {
   int a;
   int b;
   int c;
};

Above, when we allocate memory for SomeStruct, we need 12 bytes (one integer is 4 bytes). In memory, bytes 0, 1, 2, 3 are taken by a, 4, 5, 6, 7 are taken by b, and 8, 9, 10, 11 are taken by c. So, in assembly, we can just look at the offsets 0, 4, and 8 and use lw (load word) to load from this structure.

So, let’s use this structure to see what it looks like in assembly.

int work(struct SomeStruct *s) {
   return s->a + s->b * s->c;
}

In assembly, we would do the following:

.section .text
.global work
work:
   # s->a is 0(a0)
   # s->b is 4(a0)
   # s->c is 8(a0)
   lw t0, 0(a0)  # s->a
   lw t1, 4(a0)  # s->b
   lw t2, 8(a0)  # s->c
   mul t1, t1, t2
   add a0, t0, t1
   ret

The a0 register stores the memory address where the structure starts. Then, we calculate offsets for each one of the elements a, b, and c. However, this is an easy example. There are two other examples we must explore: (1) heterogeneous data types and (2) structures in arrays.

Heterogeneous Data Types

What happens when we have multiple data types inside of a structure? The C compiler tries to make the memory controller’s work easy, so it will pad the structure so that when we access a field in a structure, it only takes one request from the memory controller. With heterogeneous data types, this means we have to waste a little bit of memory.

Take the following structure for example:

struct SomeStruct {
   char a;
   int b;
   short c;
};

The structure above has 1-byte, 4-byte, and 2-byte fields, so does that mean this structure is \(1+4+2=7\) bytes? The answer is no, the reason is padding. C will pad each element so that the \(\text{offset}\%\text{size}=0\). That is, when we look at the offset of the element, it must be evenly divisible by the size of the field.

When I teach my students, I have them draw a name, offset, size table (NOS table). In here, we can see what’s going on, and it makes it easy to translate into assembly.

NameOffsetSize
char a01
int b44
short c82

We can see with the table above that int b starts at offset 4. So what happened to bytes 1, 2, and 3? Those are wasted in the form of padding. So, now we can figure out the offset of each element, so the following is now translatable:

int work(struct SomeStruct *s) {
   return s->a + s->b * s->c;
}

We saw that with three integers in a structure, we had offsets 0, 4, and 8. Well, lookie here! They are the same. The only thing we need to change is how we load. We look at the size column. Recall that 1 byte is a byte (lb/sb), 4 bytes is a word (lw/sw) and 2 bytes is a halfword (lh/sh).

.section .text
.global work
work:
   lb t0, 0(a0)  # s->a
   lw t1, 4(a0)  # s->b
   lh t2, 8(a0)  # s->c
   mul t1, t1, t2
   add a0, t1, t0
   ret

Structures in Arrays

C allows any data structure to be stored in an array. An array is simply a memory address with like fields, however those like fields can be structures. The following diagram shows how an integer-only array with 5 elements would look in memory:

You can see the formula \(\text{address}=\text{base}+\text{offset}\times\text{size}\). So, if our base is a (let’s say 0 for now), then a[4] would be \(0+4\times 4=16\).

For structures, there can possibly be padding at the end of the structures. This padding is only for arrays. Remember that we need offset % size be zero. When looking for the amount of padding at the end of the structure, we take the remaining offset and % by the largest field. We pad until this is zero.

Let’s see how a structure would look in memory.

struct My {
	char a;
	long b;
	int *c;
};
struct My m[5];
m[0].a = 'A';
m[3].b = 22;

We have an array of five of these structures. Each element has fields a, b, and c. Below is a diagram of what our memory looks like:


Globals

Recall that C generally uses the stack for local variables. However, C allows for global variables. Globals are stored in one of three sections: .rodata, .data, or .bss. The .rodata (read only data) stores global constants, the .data stores global, initialized variables, and the .bss stores global, uninitialized variables or global variables initialized to 0.

The BSS section is the weirdest one here. This section is automatically cleared to 0 by the operating system. So, any uninitialized variable actually has a defined value–0.

We can see these sections in action by looking at a C example:

#include <stdio.h>
int data = 100;
const int rodata = 200;
int bss;
int main() {
   printf("%d %d %d\n", data, rodata, bss);
   return 0;
}

If we look at what this produces, we see the following:

We can see that our rodata constant has the memory address of 0x2004 and has the value 0x00_00_00_c8 which is 200, our data variable has the memory address 0x4030 and has the value 0x00_00_64 which is 100. Finally our bss variable is stored at memory address 0x4038 and will have the value 0. The section only stores where the variable is located, not what is in it. The operating system itself will look at this section and assign all variables the value 0.


Conditions and Loops

Conditions are made by using a branch instruction. Some architectures support adding conditions to each instruction, but that is becoming more rare as architectures advance. Instead, we branch to a different label to either jump over code or jump into code. Let’s take a look at a for loop to see how it would look in assembly.

for (int i = 100;i >= 0;i--) {
   printf("%d\n", i);
}

The assembly equivalent would be something like the following:

.section .rodata
printf_string: "%d\n"
.section .text
.global main
main:
addi sp, sp, -16
sd s0, 0(sp)
sd ra, 8(sp)
li s0, 100 # int i = 100
1:
   blt s0, zero, 1f  # if i < 0, break the loop
   la a0, printf_string
   mv a1, s0
   call printf
   addi s0, s0, -1 # i--
   j 1b
1:
ld s0, 0(sp)
ld ra, 8(sp)
ret

So, if statements and loops conditionally execute code by using a branch instruction. In RISC-V, our branch instruction compares two registers. If those registers meet the condition, we go to the label (1f = label 1 forward, 1b = label 1 backwards). If the registers don’t meet the condition, the instruction has no effect and we move to the next instruction.


Data Types

The purpose of data types in C (when considering assembly) is to make sure we choose the correct load/store size (byte, halfword, word, and doubleword), however we also need to choose the correct operations. Most unsigned integers will zero-extend when a smaller data size is widened into a larger data size. For example:

unsigned char a = 100;
unsigned int b = (unsigned int)a;

The variable ‘a’ is only 8 bits, however it will widen to 32 bits. There are two ways we can widen a small data size into a larger data size. We can zero extend, by occupying the newly allocated space with zeroes. Or, we can sign extend, by taking the sign bit and replicating it over the new space. The choice depends on the data size. However, in assembly, it actually requires a different instruction. For example:

lb t0, 0(sp)
lbu t1, 0(sp)

The ‘lb’ (load byte) instruction will sign extend an 8-bit value in sp + 0 into a 64-bit register t0. The ‘lbu’ (load byte unsigned) instruction will zero extend an 8-bit value in sp + 0 into a 64-bit register t1. In RISC-V, we’re always dealing with a 64-bit register. C will make the distinction when emitting instructions between char, short, int, and longs.

Another occasion becomes what to do with a right shift (>>). There are actually two right shifts–logical and arithmetic. A logical right shift is used for unsigned data types and zeroes are inserted in the most significant digits place whenever bits are shifted right. An arithmetic right shift is used for signed data types and the sign bit is duplicated every time we shift right.

unsigned int a = 64;
int b = -64;

a = a >> 1;
b = b >> 1;

The variable ‘a’ above will be logically right shifted one place, which will give us 32. The variable ‘b’ will be arithmetically right shifted one place. If we did a logical shift, this would turn into a positive number. Since we duplicated the sign bit, ‘b’ becomes -32.

li t0, 64
li t1, -64
srli t0, t0, 1
srai t1, t1, 1

Conclusion and Further Reading

All in all, the compiler, when emitting its assembly code, must make these decisions. One data type of note is char. The standard does not 100% state whether we should sign extend or zero extend a char. In fact, I noticed a difference between GNU’s RISC-V toolchain vs GNU’s Intel/AMD toolchain. The former will sign extend, whereas the latter zero extends.

To see what a compiler is doing, it helps to look at the assembly code. GNU Binutils has objdump that can disassemble machine code back into assembly code. However, there may come a time where we look at the C code to see what a certain language’s compiler is doing. Who knows?

I was pretty light on the specifics when it comes to compiling. I recommend my colleague and friend’s blog about compilers here: http://austinhenley.com/