Author: Stephen Marz

The RISC-V APLIC’s New Features

Contents

  1. Repository
  2. Introduction
  3. The APLIC
  4. Conclusion

Repository

This blog series refers to the code written here: https://github.com/sgmarz/riscv_msi.

The APLIC specification (still in draft) is part of the Advanced Interrupt Architecture (AIA) specification, and it is kept here: https://github.com/riscv/riscv-aia.

I am using AIA specification version 0.3.0-draft.31 to write this article.


Introduction

The advanced platform level interrupt controller (APLIC) is an advanced version of SiFive’s PLIC. The main advancement is that it supports sending interrupts via message. So, when paired with an incoming MSI controller (IMSIC), the APLIC can send messages just like any other hardware device.

The main purpose of the original PLIC was to aggregate, prioritize, and send hardware interrupt signals. It provided a method for an operating system to claim, handle, and complete an interrupt request. The PLIC sent a notification to a specific HART (hardware thread — pretty much a CPU core) through an “external interrupt” pin. That HART would then determine what interrupted it by reading from a specific PLIC register called the claim register. This would return a number identifying the device. Usually, this number is a specific wire connecting a hardware device to the PLIC.


The Advanced Platform Level Interrupt Controller (APLIC)

The new APLIC is not backwards compatible with the SiFive PLIC. Some of the concepts and features are the same, but the register file and mappings are different. Technically, a system may be implemented without a full APLIC, but the AIA documentation specifically states that “[f]ull conformance to the Advanced Interrupt Architecture requires the APLIC.”

First, the APLIC registers are laid out as follows:

struct Aplic {
    pub domaincfg: u32,           // Domain CSR that controls how this APLIC functions
    pub sourcecfg: [u32; 1023],   // Source configuration for 1023 interrupts
    _reserved1: [u8; 0xBC0],

    pub mmsiaddrcfg: u32,         // Machine-level MSI address (for APLIC to write MSIs)
    pub mmsiaddrcfgh: u32,
    pub smsiaddrcfg: u32,         // Supervisor-level MSI address
    pub smsiaddrcfgh: u32,
    _reserved2: [u8; 0x30],

    pub setip: [u32; 32],         // Bitset to set pending interrupts (32 IRQS per element)
    _reserved3: [u8; 92],

    pub setipnum: u32,            // Sets a pending interrupt by number
    _reserved4: [u8; 0x20],

    pub clrip: [u32; 32],         // Bitset to clear pending interrupts (opposite of setip)
    _reserved5: [u8; 92],

    pub clripnum: u32,            // Clears a pending interrupt by number
    _reserved6: [u8; 32],

    pub setie: [u32; 32],         // Bitset to enable interrupts
    _reserved7: [u8; 92],

    pub setienum: u32,            // Enable an interrupt by number
    _reserved8: [u8; 32],

    pub clrie: [u32; 32],         // Bitset to disable interrupts (opposite of setie)
    _reserved9: [u8; 92],

    pub clrienum: u32,            // Disable an interrupt by number
    _reserved10: [u8; 32],

    pub setipnum_le: u32,         // Set an interrupt pending by number always little end first
    pub setipnum_be: u32,         // Set an interrupt pending by number always big end first
    _reserved11: [u8; 4088],

    pub genmsi: u32,              // Used to generate MSIs
    pub target: [u32; 1023],      // Target control per interrupt
}

Domain Configuration Register (domaincfg)

Domain Configuration Register (32-bits)

There are three usable fields in this register, interrupt enable (IE, bit 8), delivery mode (DM, bit 2), and big endian (BE, bit 0).

The first 8 bits is a byte order mark for all intents and purposes. It is set to 0x80 so that we can test to see whether the register is big end first or little end first. For example, if we instruct a “load word” (lw instruction in RISC-V), and we get 0x80 in the first byte, that means the machine is in big endian. If we don’t, it is little endian, and that byte we loaded contains the delivery mode and big-endian bits.

The interrupt enable bit enables the APLIC to send interrupts (1 = enabled, 0 = disabled). This doesn’t mean that the interrupt will necessarily be heard, but instead, it only means the APLIC can send interrupts by triggering a pending bit.

The delivery mode bit allows the APLIC to be configured to send normal interrupts, like the old PLIC, or to send interrupts as messages (MSIs). If the DM bit is set to 0, it will send direct interrupts, like the old APLIC. If the DM bit is set to 1, it will send MSIs instead.

The big endian bit allows the APLIC to write in big endian (BE = 1) or to write messages in little endian (BE = 0). However, the BE bit also affects the order of the multibyte domain configuration register too. The most significant byte is set to 0x80 on purpose to act as sort of a byte order mark.

We can configure a Rust function to set the domain register. All fields in this register are Boolean (yes/no, true/false, 1/0).

impl Aplic {
    pub fn set_domaincfg(&mut self, bigendian: bool, msimode: bool, enabled: bool) {
        // Rust library assures that converting a bool into u32 will use
        // 1 for true and 0 for false
        let enabled = u32::from(enabled);
        let msimode = u32::from(msimode);
        let bigendian = u32::from(bigendian);
        self.domaincfg = (enabled << 8) | (msimode << 2) | bigendian;
    }
}

Source Configuration Registers (sourcecfg[u32; 1023])

If bit 10 (delegate) is 0, SM is in bits 2:0. If D=1, bits 9:0 mark the child index.

There is one sourcecfg register for every interrupt possible. Recall that interrupt 0 is not possible, so interrupt 1’s source configuration register is in sourcecfg[0]. This is why the example Rust code I provided subtracts 1 from the interrupt source’s value.

The delegate bit (bit 10) can be read to determine if the given interrupt has been delegated. This bit is read/write. If we write a 1 into this field, it delegates it to a child domain. If that particular source does not have a child domain, this bit will always be read 0.

The source mode bits (bits 2:0) control how the interrupt is triggered for those interrupts not delegated. If an interrupt is delegated (D=1), the bits 9:0 describe the index of the child it was delegated to. If the interrupt is not delegated (D=0), then each interrupt source may be configured to trigger a “pending” interrupt by one of the following.

3-bit “SM” valueRegister NameDescription
0InactiveThe interrupt source cannot generate an interrupt and is inactive.
1DetachedThe interrupt can only be generated by writing directly to the APLIC.
2, 3Reserved
4Edge1The interrupt is asserted on a rising edge (from 0 to 1).
5Edge0The interrupt is asserted on a falling edge (from 1 to 0).
6Level1The interrupt is asserted when high (device IRQ pin is asserted).
7Level0The interrupt is asserted when low (device IRQ pin is deasserted).

An inactive interrupt source cannot generate an interrupt through the APLIC, and we cannot manually assert an interrupt by writing to the interrupt pending register. A detached interrupt source cannot be triggered by the device, however we can manually write to the MMIO APLIC registers to assert the interrupt.

#[repr(u32)]
enum SourceModes {
    Inactive = 0,
    Detached = 1,
    RisingEdge = 4,
    FallingEdge = 5,
    LevelHigh = 6,
    LevelLow = 7,
}
impl Aplic {
    pub fn set_sourcecfg(&mut self, irq: u32, mode: SourceModes) {
        assert!(irq > 0 && irq < 1024);
        self.sourcecfg[irq as usize - 1] = mode as u32;
    }

    pub fn sourcecfg_delegate(&mut self, irq: u32, child: u32) {
        assert!(irq > 0 && irq < 1024);
        self.sourcecfg[irq as usize - 1] = 1 << 10 | (child & 0x3ff);
    }
}

Set/Clear Pending/Enable Interrupt (setip/clrip) Registers

These registers control whether an interrupt is pending. The APLIC will set these bits itself whenever an enabled interrupt is triggered; however, we can manually set an interrupt as a pending interrupt.

These registers are unlike the IMSIC registers. There is a set of registers to set a pending interrupt and a set of registers to clear a pending interrupt.

The setipnum and clripnum registers function much the same way as the following Rust functions that use the setip/clrip registers.

impl Aplic
    pub fn set_ip(&mut self, irq: u32, pending: bool) {
        assert!(irq > 0 && irq < 1024);
        let irqidx = irq as usize / 32;
        let irqbit = irq as usize % 32;
        if pending {
            // self.setipnum = irq;
            self.setip[irqidx] = 1 << irqbit;
        } else {
            // self.clripnum = irq;
            self.clrip[irqidx] = 1 << irqbit;
        }
    }
}

The interrupt enable registers act much like the interrupt pending registers, except they allow the interrupts to be signaled. If an interrupt is NOT enabled, then that interrupt is masked and cannot be triggered.

impl Aplic {
    pub fn set_ie(&mut self, irq: u32, enabled: bool) {
        assert!(irq > 0 && irq < 1024);
        let irqidx = irq as usize / 32;
        let irqbit = irq as usize % 32;
        if enabled {
            // self.setienum = irq;
            self.setie[irqidx] = 1 << irqbit;
        } else {
            // self.clrienum = irq;
            self.clrie[irqidx] = 1 << irqbit;
        }
    }
}

Generate MSI Register (genmsi)

There are two read/write fields and one read-only field in the genmsi register. The genmsi register can be used to trigger an MSI write, even though writing directly to the IMSIC is more efficient.

The hart index is the HART that you want to send an MSI and the EIID (external interrupt identifier) is the value to write to the IMSIC. Usually the EIID is the same number as the interrupt you want to trigger.


Target Control Registers (target[u32; 1032])

The target control registers have two different forms based on how the APLIC is configured. If the APLIC is configured in direct delivery mode, then the register contains a HART index and a priority. Smaller priorities are higher, so 10 has a higher priority than 20.

impl Aplic {
    pub fn set_target_direct(&mut self, irq: u32, hart: u32, prio: u32) {
        assert!(irq > 0 && irq < 1024);
        self.target[irq as usize - 1] = (hart << 18) | (prio & 0xFF);
    }
}

In MSI delivery mode, the register contains a HART index, guest index, and external interrupt identifier (EIID).

impl Aplic {
    pub fn set_target_msi(&mut self, irq: u32, hart: u32, guest: u32, eiid: u32) {
        assert!(irq > 0 && irq < 1024);
        self.target[irq as usize - 1] = (hart << 18) | (guest << 12) | eiid;
    }
}

Interrupt Delivery Control

As I mentioned previously, the APLIC may be configured in MSI or direct mode. In MSI mode, the messages are handled by the incoming MSI controller (IMSIC); however, in direct mode, the APLIC itself will control interrupts. This is done through the interrupt delivery control (IDC) section of the APLIC.

For the virt machine on QEMU, the interrupt delivery control registers are memory mapped to 0x. The registers are laid out as follows.

OffsetSize (bytes)Register Name
04idelivery
44iforce
84ithreshold
244topi
284claimi

Each IDC is for an individual HART and is 32 bytes. Therefore, HART #0’s registers start at offset 0, HART #1’s registers start at offset 32, and so forth.

The idelivery register enables the APLIC’s direct delivery mode. If this register is set to 1, then the APLIC may deliver interrupts, otherwise, if this register is set to 0, then the APLIC does not deliver interrupts.

The idelivery register.

The iforce register forces the APLIC to deliver an interrupt #0. If we write 1 to this register and the APLIC’s interrupts are enabled, then this will signal an interrupt.

The iforce register

The ithreshold register sets the interrupt priority threshold for an interrupt to be heard. A threshold of 0 means that all interrupts can be heard. If the value is non-zero, then any interrupt priority of that value or higher are masked, and hence not heard. In this case, as with everything else, the lower the number, the higher the priority. Don’t let 0 fool you, it’s just a special value that unmasks all priorities. However, a threshold of 1 means all interrupts will be masked, whereas a threshold of 2 means that only priority 1 can be heard.

The ithreshold register.

The topi register holds the interrupt number that is the highest priority and is enabled. The register is split into two pieces: bits 25:16 hold the interrupt number and bits 7:0 hold the interrupt priority.

The topi and claimi registers

The claimi register is the same as topi except that it signals that we are claiming the top interrupt. When we read from this register, the pending bit for the given register will be cleared to 0.

The OS on the repo that I made only uses the MSI delivery mode, since this delivery control is much like the old PLIC’s.


Conclusion

QEMU’s virt machine connects the UART to external IRQ #10, so we can use the APLIC to send messages whenever the UART receiver has data. This requires us to set up the APLIC using the registers above.

In the code below, I split between the machine and supervisor mode and delegate the UART to the supervisor APLIC (index 0).

pub fn aplic_init() {
    // The root APLIC
    let mplic = Aplic::as_mut(AplicMode::Machine);
    // The delgated child APLIC
    let splic = Aplic::as_mut(AplicMode::Supervisor);

    // Enable both the machine and supervisor PLICS
    mplic.set_domaincfg(false, true, true);
    splic.set_domaincfg(false, true, true);

    // Write messages to IMSIC_S
    mplic.set_msiaddr(AplicMode::Supervisor, crate::imsic::IMSIC_S);

    // Delegate interrupt 10 to child 0, which is APLIC_S
    // Interrupt 10 is the UART. So, whenever the UART receives something
    // into its receiver buffer register, it triggers an IRQ #10 to the APLIC.
    mplic.sourcecfg_delegate(10, 0);

    // The EIID is the value that is written to the MSI address
    // When we read TOPEI in IMSIC, it will give us the EIID if it
    // has been enabled.
    splic.set_target_msi(10, 0, 0, 10);

    // Level high means to trigger the message delivery when the IRQ is
    // asserted (high).
    splic.set_sourcecfg(10, SourceModes::LevelHigh);

    // The order is important. QEMU will not allow enabling of the IRQ
    // unless the source configuration is set properly.
    // mplic.set_irq(10, true);
    splic.set_ie(10, true);
}

We can now write the UART handler in the IMSIC whenever the UART sends interrupts.

pub fn imsic_handle(pm: PrivMode) {
    let msgnum = imsic_pop(pm);
    match msgnum {
        0 => println!("Spurious 'no' message."),
        2 => println!("First test triggered by MMIO write successful!"),
        4 => println!("Second test triggered by EIP successful!"),
        10 => console_irq(),
        _ => println!("Unknown msi #{}", msgnum),
    }
}

The code above forwards any message #10 to console_irq, which in turn pops a value off of the receiver buffer register of the UART device.

When all is said and done, we can start getting UART messages:

RISC-V Is Getting MSIs!

Contents

  1. Overview
  2. Repository
  3. Message Signaled Interrupts (MSI)
  4. Incoming MSI Controller (IMSIC)
  5. Conclusion
  6. What’s Next

Overview

Message signaled interrupts or MSIs describe a way to signal an interrupt without a dedicated interrupt request pin (IRQ). One of the most prevalent uses for MSIs is the PCI bus, and the PCI specification defines the MSI and MSI-X standards. The potential benefits may include: (1) reduced number of direct wires from the device to the CPU or interrupt controller, (2) improve signaling performance–mainly by forcing in-band signals by design, and (3) improving guest/host signaling for virtualized environments.

Wikipedia’s article on MSIs can be found here: https://en.wikipedia.org/wiki/Message_Signaled_Interrupts


Repository

All of the code for this post can be found here: https://github.com/sgmarz/riscv_msi. Use tag “msi”.

The code is written for RV32I in Rust. I originally wrote it for RV64GC, but everything else I wrote is also for RV64GC, so I figured I should branch out and broaden my horizons.

The AIA draft manual, written by John Hauser, can be found here: https://github.com/riscv/riscv-aia. This blog post uses version 0.3.0-draft.31, tagged on 13-Jun-2022.


Message Signaled Interrupts (MSI)

An MSI is triggered by a “message”, which is a fancy term for a “memory write”. In fact, we can trigger a message by simply dereferencing an MMIO pointer.

// Note that 0xdeadbeef is not really a valid message. 
// The AIA specifies messages 1 through 2047 are valid due to the number of registers
// available. But, the following is just an example.
let message = 0xdeadbeef;
// QEMU's 'virt' machine attaches the M-mode IMSIC for HART 0 to 0x2400_0000
// The AIA specifies that this must be a word, a double word will cause a trap.
write_volatile(0x2400_0000 as *mut u32, message);

Memory Mapped IO Addresses for Interrupt Files

The code above writes to the MMIO address 0x2400_0000, which is where QEMU’s virt machine connects the M-mode IMSIC for HART 0. The S-mode IMSIC for HART 0 is connected to 0x2800_0000. Each HART is a page away from each other, meaning the M-mode IMSIC for HART 1 is at 0x2400_1000, and the S-mode IMSIC for HART 1 is 0x2800_1000.

For many embedded systems, these values would come from a specification or from an open firmware (OF) package that contained a flattened device tree (FDT) that specifies an MMIO address and what’s connected to it. Here’s an example of QEMU’s virt FDT as plain text. For the repo, I hard coded the MMIO addresses instead of reading a device tree.

imsics@24000000 {
   phandle = <0x09>;
   riscv,ipi-id = <0x01>;
   riscv,num-ids = <0xff>;
   reg = <0x00 0x24000000 0x00 0x4000>;
   interrupts-extended = <0x08 0x0b 0x06 0x0b 0x04 0x0b 0x02 0x0b>;
   msi-controller;
   interrupt-controller;
   compatible = "riscv,imsics";
};

The organization that standardizes device tree formats can be found here: https://www.devicetree.org/

More information about device trees in Linux can be found here: https://www.kernel.org/doc/html/latest/devicetree/usage-model.html

IMSICs were added to QEMU’s virt machine fairly recently, so you may need to clone and build your own QEMU. QEMU’s repository can be found here: https://github.com/qemu.

Triggering an Interrupt by Sending a Message

After a device writes a word to a specific MMIO address, the interrupt is triggered. This means that devices do not need a wire connecting it to an IRQ controller, such as RISC-V’s platform-level interrupt controller, or PLIC. Instead, as long as the device can make a memory write, it can trigger an interrupt.

Even though triggering a message is that simple, we need a mechanism to enable and prioritize these messages. There might be some circumstances where we don’t want to hear certain messages. This is where the incoming MSI controller or IMSIC comes into play.


Incoming MSI Controller (IMSIC)

To be able to support MSIs, some device needs to be able to take memory writes and turn them into interrupts. Furthermore, the device needs to provide a mechanism to enable/disable and to prioritize interrupts just like a regular interrupt controller. This is done by the incoming MSI controller (IMSIC) device.

WARNING: The Advanced Interrupt Architecture (AIA) manual is still a work-in-progress, and already, there have been major changes that removed or added CSRs or other pertinent information. So, some of the code and tables might be outdated.

The register mechanism for the IMSIC consists of several control and status registers (CSRs) as well as internal registers accessibly through a selection mechanism.

Newly Added Registers

The AIA defines several new CSRs separated between the machine and supervisor modes.

Register NameRegister NumberDescription
MISELECT0x350Machine register select
SISELECT0x150Supervisor register select
MIREG0x351A R/W view of the selected register in MISELECT
SIREG0x151A R/W view of the selected register in SISELECT
MTOPI0xFB0Machine top-level interrupt
STOPI0xDB0Supervisor top-level interrupt
MTOPEI0x35CMachine top-level external interrupt (requires IMSIC)
STOPEI0x15CSupervisor top-level external interrupt (requires IMSIC)
New CSRs defined for the AIA.

The registers MISELECT and MIREG allow us to select a register by writing its number into the MISELECT register. Then the MIREG will represent the selected register. For example, if we read from MIREG, we read from the selected register, and if we write to MIREG, we write to the selected register.

There are four selectable registers. There are machine and supervisor versions of these registers. For example, if we write to SISELECT, we will view the supervisor version of the register.

Register NameMISELECT/SISELECTDescription
EIDELIVERY0x70External Interrupt Delivery Register
EITHRESHOLD0x72External Interrupt Threshold Register
EIP0 through EIP630x80 through 0xBFExternal Interrupt Pending Registers
EIE0 through EIE630xC0 through 0xFFExternal Interrupt Enable Registers
Registers selectable by MISELECT/SISELECT and readable/writeable via MIREG/SIREG.

The first thing we need to do is enable the IMSIC itself. This is done through a register called EIDELIVERY for “enable interrupt delivery” register. This register may contain one of three values:

ValueDescription
0Interrupt delivery is disabled
1Interrupt delivery is enabled
0x4000_0000Optional interrupt delivery via a PLIC or APLIC
Value written to EIDELIVERY register

Enabling the IMSIC

So, we need to write 1 (Interrupt delivery is enabled) into the EIDELIVERY register:

// First, enable the interrupt file
// 0 = disabled
// 1 = enabled
// 0x4000_0000 = use PLIC instead
imsic_write(MISELECT, EIDELIVERY);
imsic_write(MIREG, 1);

The EITHRESHOLD register creates a threshold that interrupt priorities must be before it can be heard. If an interrupt has a priority less than the value in EITHRESHOLD, it will be “heard” or unmasked. Otherwise, it will be masked and will not be heard. For example, an EITHRESHOLD of 5 would only permit messages 1, 2, 3, and 4 to be heard. The message 0 is reserved to mean “no message”.

Since a higher threshold opens more messages, messages with a lower number have a higher priority.

// Set the interrupt threshold.
// 0 = enable all interrupts
// P = enable < P only
imsic_write(MISELECT, EITHRESHOLD);
// Only hear 1, 2, 3, and 4
imsic_write(MIREG, 5);

Interrupt Priorities

The AIA documentation uses the message itself as the priority. So, a message of 1 has a priority of 1, whereas a message of 1234 has a priority of 1234. This is more convenient since we can control messages directly. However, since each message number has associated enable and pending bits, there is a limit to the highest numbered interrupt. The specification has a maximum of \(32\times64 – 1 = 2,047\) total messages (we subtract one to remove 0 as a valid message).

Enabling Messages

The EIE register controls whether a message is enabled or disabled. For RV64, these registers are 64-bits wide, but still take up two adjacent register numbers. So, for RV64, only even numbered registers are selectable (e.g., EIE0, EIE2, EIE4, …, EIE62). If you try to select an odd numbered EIE, you will get an invalid instruction trap. This took me many hours to figure out even though the documentation states this is the desired behavior. For RV32, the EIE registers are only 32-bits, and EIE0 through EIE63 are all selectable.

The EIE register is a bitset. If the bit for a corresponding message is 1, then it is unmasked and enabled, otherwise it is masked and is disabled. For RV64, messages 0 through 63 are all in EIE0[63:0]. The bit is the message. We can use the following formulas to determine which register to select for RV64:

// Enable a message number for machine mode (RV64)
fn imsic_m_enable(which: usize) {
    let eiebyte = EIE0 + 2 * which / 64;
    let bit = which % 64;

    imsic_write(MISELECT, eiebyte);
    let reg = imsic_read(MIREG);
    imsic_write(MIREG, reg | 1 << bit);
}

RV32 behaves much the same, except we don’t have to scale it by 2.

// Enable a message number for machine mode (RV32)
fn imsic_m_enable(which: usize) {
    let eiebyte = EIE0 + which / 32;
    let bit = which % 32;

    imsic_write(MISELECT, eiebyte);
    let reg = imsic_read(MIREG);
    imsic_write(MIREG, reg | 1 << bit);
}

With the code above, we can now enable the messages we want to hear. The following example enables messages 2, 4, and 10.

imsic_m_enable(2);
imsic_m_enable(4);
imsic_m_enable(10);

Pending Messages

The EIP registers behave in the exact same way as the EIE registers except that a bit of 1 means that that particular message is pending, meaning a write to the IMSIC with that message number was sent. The EIP register is read/write. If we read from it, we can determine which messages are pending. If we write to it, we can manually trigger an interrupt message by writing a 1 to the corresponding message bit.

// Trigger a message by writing to EIP for Machine mode in RV64
fn imsic_m_trigger(which: usize) {
    let eipbyte = EIP0 + 2 * which / 64;
    let bit = which % 64;

    imsic_write(MISELECT, eipbyte);
    let reg = imsic_read(MIREG);
    imsic_write(MIREG, reg | 1 << bit);
}

Testing

Now that we can enable the delivery as well as indiviual messages, we can trigger them one of two ways: (1) write the message directly to the MMIO address or (2) set the interrupt pending bit of the corresponding message to 1.

unsafe {
    // We are required to write only 32 bits.
    // Write the message directly to MMIO to trigger
    write_volatile(0x2400_0000 as *mut u32, 2);
}
// Set the EIP bit to trigger
imsic_m_trigger(2);

Message Traps

Whenever an unmasked message is sent to an enabled IMSIC, it will come to the specified HART as an external interrupt. For the machine-mode IMSIC, this will come as asynchronous cause 11, and for the supervisor-mode IMSIC, this will come as asynchronous cause 9.

When we receive an interrupt due to a message being delivered, we will need to “pop” off the top level pending interrupt by reading from the MTOPEI or STOPEI registers depending on the privilege mode. This will give us a value where bits 26:16 contain the message number and bits 10:0 contain the interrupt priority. Yes, the message number and message priority are the same number, so we can choose either.

// Pop the top pending message
fn imsic_m_pop() -> u32 {
    let ret: u32;
    unsafe {
        // 0x35C is the MTOPEI CSR.
        asm!("csrrw {retval}, 0x35C, zero", retval = out(reg) ret),
    }
    // Message number starts at bit 16
    ret >> 16
}

My compiler does not support the names of the CSRs in this specification, so I used the CSR number instead. That is why you see 0x35C instead of mtopei, but they mean the same thing.

When we read from the MTOPEI register (0x35C), it will give us the message number of the highest priority message. The instruction csrrw in the code snippet above will atomically read the value of the CSR into the return register and then store the value zero into the CSR.

When we write zero into the MTOPEI register (0x35C), we are telling the IMSIC that we are “claiming” that we are handling the topmost message, which clears the EIP bit for the corresponding message number.

/// Handle an IMSIC trap. Called from `trap::rust_trap`
pub fn imsic_m_handle() {
    let msgnum = imsic_m_pop();
    match msgnum {
        0 => println!("Spurious message (MTOPEI returned 0)"),
        2 => println!("First test triggered by MMIO write successful!"),
        4 => println!("Second test triggered by EIP successful!"),
        _ => println!("Unknown msi #{}", v),
    }
}

Message 0 is not a valid message since when we pop from the MTOPEI, a 0 signals “no interrupt”.

Example Output

If you run the repo with a new QEMU, you should see the following after a successful test.


Conclusion

In this post, we saw the new registers added to support the incoming MSI controller (IMSIC) for RISC-V. We also enabled the IMSIC delivery, the individual messages, and handled two ways of sending a message: via MMIO directly or by setting the corresponding EIP bit. Finally, we handled the interrupt from the trap.


What’s Next?

The second part of the AIA manual includes the new Advanced Platform-Level Interrupt Controller or APLIC. We will examine this system as well as write drivers to start looking at how this new APLIC can signal using wires or messages.

After the APLIC, we will write a driver for a PCI device and use it to send MSI-X messages.

Five Tips to Writing RISC-V Assembly

Writing assembly is itself an art. When C, C++, or any other language is compiled, the compiler determines the art of writing assembly. However, this time, we will some of the techniques and decisions we can make to write these ourselves.

We will use RISC-V to see how to design logic, write up the logic, and translate the logic into assembly.


Tip 1: Design the Logic in a Comfortable Language

This is the hardest step to get right with my students. Many students want to sit down and write the complete package. However, if you’re not comfortable with assembly, this is a doomed approach. Instead, to divide the logic from the language, we have to write in a language we understand.

If a student doesn’t know C or some lower-level language, then I suggest they write in pseudocode. Too high of a language will make the translation harder and too low of a language will make the logic design harder. So, I recommend C or C++, or some language at that level.

When translating, it is helpful to have some sort of editor that can place them side-by-side. Keeping a list of instructions in your brain is difficult, especially if you’re translating a complex program.

Atom’s “split” windows.

Tip 2: Take Small Bites

Many of my students try to write the entire program from start to finish without testing anything in between. I teach incremental programming, especially for new learners. The point is to test when you get a portion of logic completed. This could be as simple as getting a for loop done, or shifting to scale a pointer.

One way you can test is to link your C or C++ program with your assembly program. You can do this by prototyping the name of the assembly function in C++ and switching between the two. You need to make sure you keep the two different, otherwise your linker won’t be happy. I usually put a “c” in front of my C functions to differentiate. I don’t change the assembly names since in the end, we want all assembly functions to replace the C functions.

Combining assembly-written and C-written functions.

With what I’ve done above, we can call show to run the assembly function or cshow to run the C function. In C++, the names of the functions are mangled to allow for overloading. However, you can turn this off by doing the following:

extern "C" { // Turn off name mangling
    void show(int *x);
}

The extern “C” will tell C++ that the functions follow the C “calling convention”. We really care that name mangling is turned off so that we can just create a label named “show” and have our function.


Tip 3: Know Your Role

As Dwayne “The Rock” Johnson always said, “know your role”. It is important to know what C/C++ was doing for us versus what assembly doesn’t do for us. This includes order of operations. For example, 4 + 3 * 4 will automatically order the operations to perform multiplication first, followed by addition. However, in assembly, we must choose the multiply instruction first followed by the addition instruction. There is no “reordering” performed for us.


Tip 4: Know How to Call a Function

Most ISA architectures will come with a calling convention manual, such as ARM and RISC-V. These just lay down some ground rules to make calling a function work across all languages. Luckily, the “ABI” names of the RISC-V registers really lend to what they mean. Here are some of the rules we need to abide by.

  • Integer arguments go in a0-a7, floating point arguments go in fa0-fa7.
  • Allocate by subtracting from the stack pointer, deallocate by adding it back.
  • The stack must be allocated in chunks of 8.
  • All a (argument) and t (temporary) registers must be considered destroyed after a function call. Any time I see the “call” pseudo-instruction, I start thinking about saving registers using the stack.
  • All s (saved) registers can be considered saved after a function call.
  • If you use any of the saved registers, you must restore their original value before returning.
  • Return data back to the caller through the a0 register.
  • Don’t forget to save the one and only RA (return address) register any time you see a call instruction (jal or jalr) in your function.

Obviously, I’ve reinterpreted the rules in a more informal way, but you get the jist of what’s going on here.

.global main
main:
    addi    sp, sp, -8
    sd      ra, 0(sp)
    la      a0, test_solve
    call    solve
    mv      a0, zero
    ld      ra, 0(sp)
    addi    sp, sp, 8
    ret

You can see from the code above, we allocate our stack frame first, save all registers that need to be saved, execute, and then undo everything before returning.


Tip 5: Document!

Writing assembly from C or another language will have you writing multiple lines of assembly code for every single line of C code. This can get confusing and utterly frustrating if you’re trying to debug your program. So, I always write the C code as a comment for the assembly and then pull it apart and show each step of me doing it.

# used |= 1 << ( x[i * 9 + col] - 1)
    li      t0, 9
    mul     t1, s3, t0          # t1 = i * 9
    add     t1, t1, s2          # t1 = i * 9 + col
    slli    t2, t1, 2           # Scale by 4
    add     t2, t2, s6          # x + i * 9 + col
    lw      t3, 0(t2)           # x[i * 9 + col]
    addi    t3, t3, -1          # x[i * 9 + col] - 1
    li      t4, 1
    sll     t4, t4, t3          # 1 << x[i * 9 + col] - 1
    or      s5, s5, t4          # used |= ...

You can see from the code above, I have the original C code (first comment), and then inline comments for each piece. This keeps me honest when it comes to order of operations and that I’m doing each step correctly.

Writing Pong in Rust for my OS Written in Rust

This post is part of a larger effort you can view over here: https://osblog.stephenmarz.com.

Pong being played on my custom, RISC-V OS in Rust!

Video

Contents

  1. Overview
  2. Application Programmer’s Interface (API)
  3. Starting Routines
  4. System Calls
  5. Drawing Primitives
  6. Event Handling
  7. Start Our Game
  8. Game Loop
  9. PLAY

Overview

We last left off writing a graphics driver and an event driver for our operating system. We also added several system calls to handle drawing primitives as well as handling keyboard and mouse inputs. We are now going to use those to animate the simple game of pong. Just like hello world is the test of every programming language, pong is a test of all of our graphics and event systems.


Application Programmer’s Interface (API)

Since we’re writing in user space, we need to use the system calls we developed to (1) enumerate the framebuffer, (2) invalidate sections of the framebuffer, (3) read keyboard/click events, and (4) read motion events–mouse or tablet.

For Rust, we are still using the riscv64gc-unknown-none-elf target, which doesn’t contain a runtime. Now, riscv64gc-unknown-linux-gnu does exist, however it uses Linux system calls, which we haven’t implemented, yet. So, we are required to create our own system call interface as well as runtime!


Starting Routines

Many programmers know that ELF allows our entry point to start anywhere. However, most linkers will look for a symbol called _start. So, in Rust, we have to create this symbol. Unfortunately, after many attempts, I could not get it to work fully in Rust. So, instead, I used the global_asm! macro to import an assembly file. All it does is call main and invokes the exit system call when main returns.

.section .text.init
.global _start
_start:
.option push
.option norelax
	la	gp, __global_pointer$
.option pop
	li		a0, 0
	li		a1, 0
	call	main
	# Exit system call after main
	li	a7, 93
	ecall
.type _start, function
.size _start, .-_start

I did fail to mention about the global pointer. Since global variables can be stored far away, we use a register to store the top of this location. Luckily, most linkers have a symbol called __global_pointer$ (yes, even with the $) that we put into the gp (global pointer) register. I’m not going to cover relaxation, but this is necessary.

This is all that we need to get into Rust. However, since we’re still in baremetal Rust, we have to define the same symbols we did for our OS, including the panic and abort handlers.

#![no_std]
#![feature(asm, panic_info_message, lang_items, start, global_asm)]

#[lang = "eh_personality"]
extern "C" fn eh_personality() {}

#[panic_handler]
fn panic(info: &core::panic::PanicInfo) -> ! {
	print!("Aborting: ");
	if let Some(p) = info.location() {
		println!("line {}, file {}: {}", p.line(), p.file(), info.message().unwrap());
	} else {
		println!("no information available.");
	}
	abort();
}
#[no_mangle]
extern "C" fn abort() -> ! {
	loop {
		unsafe {
			asm!("wfi");
		}
	}
}

This allows our code to at least compile, but as you can see we need to import the assembly file as well as define a main.

global_asm!(include_str!("start.S"));
#[start]
fn main(_argc: isize, _argv: *const *const u8) -> isize {
    0
}

Now we have an entry point for Rust. Now, we can create the println and print macros to make Rust be more Rusty for us.

#[macro_export]
macro_rules! print
{
	($($args:tt)+) => ({
			use core::fmt::Write;
			let _ = write!(crate::syscall::Writer, $($args)+);
			});
}
#[macro_export]
macro_rules! println
{
	() => ({
		   print!("\r\n")
		   });
	($fmt:expr) => ({
			print!(concat!($fmt, "\r\n"))
			});
	($fmt:expr, $($args:tt)+) => ({
			print!(concat!($fmt, "\r\n"), $($args)+)
			});
}

System Calls

We need to create an API for system calls, since that will be how we get and put certain data to our operating system. Generally, the runtime will establish this for us, but again, we’re in baremetal Rust.

use core::fmt::{Write, Error};
use crate::event::Event;

pub struct Writer;

impl Write for Writer {
	fn write_str(&mut self, out: &str) -> Result<(), Error> {
		for c in out.bytes() {
			putchar(c);
		}
		Ok(())
	}
}

pub fn putchar(c: u8) -> usize {
    syscall(2, c as usize, 0, 0, 0, 0, 0, 0)
}

pub fn sleep(tm: usize) {
    let _ = syscall(10, tm, 0, 0, 0, 0, 0, 0);
}

pub fn get_fb(which_fb: usize) -> usize {
    syscall(1000, which_fb, 0, 0, 0, 0, 0, 0)
}

pub fn inv_rect(d: usize, x: usize, y: usize, w: usize, h: usize) {
    let _ = syscall(1001, d, x, y, w, h, 0, 0);
} 

pub fn get_keys(x: *mut Event, y: usize) -> usize {	
    syscall(1002, x as usize, y, 0, 0, 0, 0, 0)
}

pub fn get_abs(x: *mut Event, y: usize) -> usize {	
    syscall(1004, x as usize, y, 0, 0, 0, 0, 0)
}

pub fn get_time() -> usize {
     syscall(1062, 0, 0, 0, 0, 0, 0, 0)
}

pub fn syscall(sysno: usize, a0: usize, a1: usize, a2: usize, a3: usize, a4: usize, a5: usize, a6: usize) -> usize {
    let ret;
    unsafe {
    asm!("ecall",
        in ("a7") sysno,
        in ("a0") a0,
        in ("a1") a1,
        in ("a2") a2,
        in ("a3") a3,
        in ("a4") a4,
        in ("a5") a5,
        in ("a6") a6,
        lateout("a0") ret);
    }
    ret
}

MAN, I love the new Rust asm! It is very clear what’s happening when you look at it. Usually when I use inline assembly, register selection is extremely important. This makes it rather foolproof–dare I actually say that?

Just like we did for our operating system, we use the Write trait to hook into the format macro that is given to us by Rust.


Drawing Primitives

Ok, with the system calls and the startup code in start.S, we now have everything we need to start programming the game. So, let’s make some drawing primitives. Since this is pong, we only really care about drawing rectangles.

#[repr(C)]
#[derive(Clone,Copy)]
pub struct Pixel {
    pub r: u8,
    pub g: u8,
    pub b: u8,
    pub a: u8,
}

pub type Color = Pixel;

impl Pixel {
    pub fn new(r: u8, g: u8, b: u8) -> Self {
        Self {
            r, g, b, a: 255
        }
    }
}

pub struct Vector {
    pub x: i32,
    pub y: i32
}

impl Vector {
    pub fn new(x: i32, y: i32) -> Self {
        Self {
            x,  y
        }
    }
}

pub struct Rectangle {
    pub x: i32,
    pub y: i32,
    pub width: i32,
    pub height: i32,
}

impl Rectangle {
    pub fn new(x: i32, y: i32, width: i32, height: i32) -> Self {
        Self {
            x, y, width, height
        }
    }
}
pub struct Framebuffer {
    pixels: *mut Pixel
}

impl Framebuffer {
    pub fn new(pixels: *mut Pixel) -> Self {
        Self { pixels }
    }
    pub fn set(&mut self, x: i32, y: i32, pixel: &Pixel) {
        unsafe {
            if x < 640 && y < 480 {
                let v = (y * 640 + x) as isize;
                self.pixels.offset(v).write(*pixel);
            }
        }
    }
    pub fn fill_rect(&mut self, rect: &Rectangle, color: &Pixel) {
        let row_start = rect.y;
        let row_finish = row_start + rect.height;
        let col_start = rect.x;
        let col_finish = col_start + rect.width;
        for row in row_start..row_finish {
            for col in col_start..col_finish {
                self.set(col, row, color);
            }
        }
    }
}

pub fn lerp(value: i32, mx1: i32, mx2: i32) -> i32 {
    let r = (value as f64) / (mx1 as f64);
	return r as i32 * mx2;
}

Now we have a pixel structure, a vector, a rectangle, and a framebuffer. The pixel comes from the operating system, so it is important that we control that structure, which necessitates #[repr(C)].


Event Handling

We can draw, so now we need to be able to handle input. We can create an event structure to handle this.

#[repr(C)]
#[derive(Copy, Clone)]
pub struct Event {
    pub event_type: u16,
	pub code: u16,
	pub value: u32
} 

impl Event {
	pub fn empty() -> Self {
		Self {
			event_type: 0,
			code: 0,
			value: 0
		}
	}
	pub fn new(event_type: u16, code: u16, value: u32) -> Self {
		Self {
			event_type,
			code,
			value
		}
	}
}

// Key codes
pub const KEY_RESERVED: u16 = 0;
pub const KEY_ESC: u16 = 1;
pub const KEY_1: u16 = 2;
pub const KEY_2: u16 = 3;
// ... CLIP ...
pub const KEY_END: u16 = 107;
pub const KEY_DOWN: u16 = 108;

// mouse buttons

pub const BTN_MOUSE: u16 = 0x110;
pub const BTN_LEFT: u16 = 0x110;
pub const BTN_RIGHT: u16 = 0x111;
pub const BTN_MIDDLE: u16 = 0x112;


// mouse movement

pub const ABS_X: u16 = 0x00;
pub const ABS_Y: u16 = 0x01;
pub const ABS_Z: u16 = 0x02;

Many of the constants came from libevdev’s input-event-codes.h. Unfortunately, that’s in C, so a little Python script could make it Rust.

Just like the Pixel structure, the Event structure is defined by our operating system, so we are required to control it ourselves (hence #[repr(C)]).


Start Our Game

const MAX_EVENTS: usize = 25;
const GAME_FRAME_TIMER: usize = 1000;

#[start]
fn main(_argc: isize, _argv: *const *const u8) -> isize {
	use drawing::Framebuffer;
	use drawing::Pixel;

	let ufb = syscall::get_fb(6) as *mut Pixel;
	let mut fb = Framebuffer::new(ufb);
	let background_color = drawing::Pixel::new(25, 36, 100);
	let mut event_list = [event::Event::empty(); MAX_EVENTS];
	let event_list_ptr = event_list.as_mut_ptr();

	let player_color = drawing::Pixel::new(255, 0, 0);
	let npc_color = drawing::Pixel::new(0, 255, 0);
	let ball_color = drawing::Pixel::new(255, 255, 255);

	let mut game = pong::Pong::new(player_color, npc_color, ball_color, background_color);
	// GAME LOOP HERE
	println!("Goodbye :)");
	0
}

You can see the first thing we do is grab a framebuffer. If you recall from the operating system tutorial, our operating system will map the pixels into our application’s memory space. We can update the pixels as we see fit, but to actually realize it on the screen, we must invalidate the given pixels, which is a separate system call in our OS.

You can see we have a structure called Pong. This structure contains the routines we need to make this somewhat a game. We have a timing function called advance_frame, and we have a way to get the game onto the screen using the draw function.

This is pretty gross, but here’s the Pong structure’s advance frame and draw routines (the rest can be found on GitHub):

impl Pong {
	pub fn advance_frame(&mut self) {
		if !self.paused {
            self.move_ball(self.ball_direction.x, self.ball_direction.y);
			let miss = 
			if self.ball.location.x < 40 {
				// This means we're in the player's paddle location. Let's
				// see if this is a hit or a miss!
				let paddle = (self.player.location.y, self.player.location.y + self.player.location.height);
				let ball = (self.ball.location.y, self.ball.location.y + self.ball.location.height);

				if paddle.0 <= ball.0 && paddle.1 >= ball.0 {
					false
				}
				else if paddle.0 <= ball.1 && paddle.1 >= ball.1 {
					false
				}
				else {
					true
				}
			}
			else {
				false
			};
			if miss {
				self.reset();
				self.paused = true;
			}
			else {
				if self.ball.location.x < 40 || self.ball.location.x > 580 {
					self.ball_direction.x = -self.ball_direction.x;
				}
				if self.ball.location.y < 20 || self.ball.location.y > 430 {
					self.ball_direction.y = -self.ball_direction.y;
				}
				let new_loc = self.ball.location.y - self.npc.location.height / 2;
				self.npc.location.y = if new_loc > 0 { new_loc } else { 0 };
			}
		}
	}
	pub fn draw(&self, fb: &mut Framebuffer) {
		fb.fill_rect(&Rectangle::new(0, 0, 640, 480), &self.bgcolor);
		fb.fill_rect(&self.player.location, &self.player.color);
		fb.fill_rect(&self.npc.location, &self.npc.color);
		fb.fill_rect(&self.ball.location, &self.ball.color);
	}
}

So our game will move the ball given the direction vector every frame due to advance_frame being called from our game loop. We also have the ability to pause the game (or unpause). Finally, we have very basic collision detection for the player’s paddle.


Game Loop

Now we need to put all of this together using a game loop, which will handle input events as well as advance the animation.

  // GAME LOOP
  gameloop: loop {
		// handle mouse buttons and keyboard inputs
		// println!("Try get keys");
		let num_events = syscall::get_keys(event_list_ptr, MAX_EVENTS);
		for e in 0..num_events {
			let ref ev = event_list[e];
			// println!("Key {}  Value {}", ev.code, ev.value);
			// Value = 1 if key is PRESSED or 0 if RELEASED
			match ev.code {
				event::KEY_Q => break 'gameloop,
				event::KEY_R => game.reset(),
				event::KEY_W | event::KEY_UP => game.move_player(-20),
				event::KEY_S | event::KEY_DOWN => game.move_player(20),
				event::KEY_SPACE => if ev.value == 1 { 
					game.toggle_pause();
					if game.is_paused() {
						println!("GAME PAUSED");
					}
					else {
						println!("GAME UNPAUSED")
					}
				},
				_ => {}
			}
		}
		game.advance_frame();
		game.draw(&mut fb);
		syscall::inv_rect(6, 0, 0, 640, 480);
		syscall::sleep(GAME_FRAME_TIMER);
	}

PLAY

Our event loop uses W to move the paddle up or D to move the paddle down. Space toggles the pause and R resets the game.

Hmmm. When I started writing this, it seemed more impressive in my mind. Oh well, have fun!

Why did I catch s*** for my use of the RISC-V SFENCE.VMA instruction?

This post is part of a longer OS tutorial which can be found here: https://osblog.stephenmarz.com

Contents

  1. Introduction
  2. What is SATP?
  3. What is SFENCE.VMA?
  4. What is happening?
  5. The Translation Lookaside Buffer
  6. Conclusion
  7. References

Introduction

My last post garnered some attention by those telling me that I “forgot” to execute an SFENCE.VMA after I wrote to the SATP register–some with more tact than others. This post is here to clarify why I did what I did, and to clarify what the specification actually tells us needs to be done.


What is SATP?

The supervisor address translation and protection (SATP) register is a register that tells the MMU what mode it’s in, what address space it is working in, and where to find the first level page table in RAM (this would be level 2 for Sv39).

The SATP Register Fields

The SATP register stores three pieces of information, the MODE, the address space identifier (ASID), and the physical page number (PPN).

The MODE

If the MODE=0, then the MMU is turned off and any address is not translated.

If MODE=8, we’re in Sv39 (Supervisor 39-bit) mode which means that our virtual addresses are 39-bits long.

There are other modes, but I won’t cover them here.

The ASID

The address space identifier tags translations with a unique identifier.

The PPN

The physical page number is the upper 44-bits of a 56-bit memory address where we can find the first level page table.


What is SFENCE.VMA?

In absolute terms, this means supervisor fence.virtual memory address. In real terms, it will flush the cache’s so that the MMU will “see” the new changes in memory. This means that the MMU will be forced to look in RAM where the page tables are stored.

The SFENCE.VMA instruction has several different ways to execute it as shown in the specification. This should clue us in to the fact that executing SFENCE.VMA every time we write to SATP might not be so cut and dry.


What is Happening?

So, why is this not straightforward? The issue is that “walking” a page table–meaning going into RAM and translating a virtual address into a physical address–is not a fast process. There are multiple levels of page tables, and several 8-byte entries that need to be dereferenced by the memory controller.

We can speed these walks up by “caching” frequently translated addresses into a table. The table has the virtual address number and the physical address number, so translation is just a table lookup instead of dereferencing several levels of page tables.

This caching can be speculative. If the MMU doesn’t speculate, then the first time we translate an address, that address will not be in the TLB (the cache table) and we will get what is known as a compulsory miss. If we speculate, we can predict the addresses that will be used and when the memory controller isn’t doing anything else, we can load the virtual address and physical address into cache.

This speculation is one of the reasons for SFENCE.VMA. Another reason is due to the fact that when we translate a page using the MMU, it stores the most recent translations in the TLB as well.


The Translation Lookaside Buffer

The translation lookaside buffer or TLB is a fancy term for the MMU cache. It stores the most recent translations to exploit temporal locality–that is, the chances we’re going to translate the same address near in the future is likely. So, instead of having to walk the page tables all over again, we just look in the TLB.

The TLB has several entries, and with RISC-V, it stores an address space identifier (ASID). The address space identifier allows the TLB to store more entries than just the most recent page table. This has always been a problem with TLBs, including with the Intel/AMD processor. Writing to its MMU register (called CR3 for control register #3) will cause a TLB flush. This is NOT the case with RISC-V writing to the SATP register (the MMU register in RISC-V).

The specification just gives the general rules for a manufacturer to use. Therefore, the manufacturer can choose how they want to implement their MMU and TLB as long as it complies with RISC-V’s privileged specification. Here’s a simple implementation of a TLB that complies with the privileged specification.


Conclusion

The RISC-V specification doesn’t make it very clear, but you can see clarification on the spec’s github repository. If the MODE is not 0, then the MMU is allowed to speculate, meaning it can pre-populate the MMU based on the addresses it thinks will need to be translated in the near future. The specification allows this, but the MMU cannot throw a page fault if a speculatory translation is invalid.

So, bottom line — SFENCE.VMA should NOT be called every time SATP is changed. This will cause TLB thrashing since every time you context switch, you will need to change the SATP register to the kernel page table, schedule, then change the SATP register to the new scheduled process’ page table.

Instead, the SFENCE.VMA instruction should be invoked when one or more of the following occur:

  1. When software recycles an ASID (i.e., reassociates it with a different page table), it should first change satp to point to the new page table using the recycled ASID, then execute SFENCE.VMA with rs1=x0 and rs2 set to the recycled ASID. Alternatively, software can execute the same SFENCE.VMA instruction while a different ASID is loaded into satp, provided the next time satp is loaded with the recycled ASID, it is simultaneously loaded with the new page table.
  2. If the implementation does not provide ASIDs, or software chooses to always use ASID 0, then after every satp write, software should execute SFENCE.VMA with rs1=x0. In the common case that no global translations have been modified, rs2 should be set to a register other than x0 but which contains the value zero, so that global translations are not flushed.
  3. If software modifies a non-leaf PTE, it should execute SFENCE.VMA with rs1=x0. If any PTE along the traversal path had its G bit set, rs2 must be x0; otherwise, rs2 should be set to the ASID for which the translation is being modified.
  4. If software modifies a leaf PTE, it should execute SFENCE.VMA with rs1 set to a virtual address within the page. If any PTE along the traversal path had its G bit set, rs2 must be x0; otherwise, rs2 should be set to the ASID for which the translation is being modified.
  5. For the special cases of increasing the permissions on a leaf PTE and changing an invalid PTE to a valid leaf, software may choose to execute the SFENCE.VMA lazily. After modifying the PTE but before executing SFENCE.VMA, either the new or old permissions will be used. In the latter case, a page fault exception might occur, at which point software should execute SFENCE.VMA in accordance with the previous bullet point.

Unfortunately, you have to dig through the issues and updates to the specification on GitHub to find out some of this information. I have provided links in references below.


References

Back That ‘S’ Up: Moving to RISC-V’s Supervisor Mode

This is a continuation of an ongoing theme which is started here: https://osblog.stephenmarz.com.

Contents

  1. What is Supervisor Mode?
  2. Why Supervisor Mode?
  3. Complications while in Supervisor Mode
  4. Complications with Interrupts
  5. Conclusion


What is Supervisor Mode?

My original OS Blog (see here: https://osblog.stephenmarz.com) ran the operating system in RISC-V’s machine mode, which is the most privileged mode in the RISC-V architecture. It has access to all control and status registers, runs in physical memory only, and has no restrictions placed upon it by the CPU.

I recently ran across the OpenSBI (Open Supervisor Binary Interface), listed on Github here: https://github.com/riscv/opensbi. It’s been around a little while, but it seeks to be an interface between the operating system and the machine–the low-level system.

Currently, OpenSBI has a legacy interface that abstracts the UART device for you, as well as a HART (hardware thread–RISC-V’s name for a CPU core) management system–start a hart at a given address, stop a hart, get the status of a hart, etc.

The supervisor mode, known as S mode in RISC-V, is two layers below Machine mode, as shown in the RISC-V specification below. This means that our operating system uses the OpenSBI set of utilities for very low level things, in which the OpenSBI abstracts away. This makes designing an operating system for the myriad of boards a bit easier.

RISC-V Privilege Levels

Why Supervisor Mode?

Why run the operating system at S mode? Well, whereas a high-level language as the application binary interface, or ABI, the operating system can rely on a certain set of utilities given by a RISC-V “bios” for a lack of a better term.

This allows our operating system to abstract much of the machine architecture away. Instead of relying on the actual system’s specification, we can program the operating system using more or less the RISC-V specification only.

To understand what’s going on, let’s take a look at the user level. This is where applications live. Whenever a user application runs afoul of the “rules”, such as dereferencing an invalid memory location or executing an illegal instruction, the CPU will trap to the machine mode, unless the trap is delegated lower. Luckily for us, OpenSBI delegates user traps to the supervisor mode, so our operating system can handle it.

Now, let’s move up one level. What happens when the operating system runs afoul of the rules? In most respects, the system crashes. An illegal instruction in machine mode will trap into machine mode. This can potentially cause a loop of traps in machine mode, as one compounds another.

So, as U-mode is to S-mode, S-mode is to M-mode, meaning that if the operating system runs at S-mode and messes with the CPU, then the OpenSBI will handle the trap. Usually, this means trying to load the hart into a stable, known state.


Complications while in Supervisor Mode

I wrote my operating system in machine mode to make it easier to understand the RISC-V architecture. Now, switching to S-mode has complicated some things. One of the major issues I have found is that the hart’s unique id is in the mhartid register. That little m in front of it means that it is a machine-mode register. Since we’re at a lower level, we’re not allowed to access any of the machine mode registers. If we try, we will get an illegal instruction.

This means that we have to keep track of our own core id. This makes context switching and putting applications on a different core slightly more complicated. We can’t just read the mhartid register and know our hart id. Instead, we have to follow the OpenSBI interface. This is easier said than done!

Now that we don’t have access to any of the machine mode registers, we have to use the supervisor levels. Luckily, RISC-V gives us “views” of the same register with an s in front of it. For example, mstatus becomes sstatus. The neat thing is that if we write to sstatus, the machine-only bits are masked, and we can only set the supervisor bits. This means we can’t switch ourselves into machine mode while in supervisor mode by simply setting the mstatus bits (bits 12 and 11).

Machine Status Register Bits

The sstatus register is the exact same register as the mstatus register, but when we make writes to it, the CPU will not change any machine-only bits. This is called a “view” of the machine status register. Here’s the supervisor status register. Take note of the machine bits that are masked.

Supervisor Status Register Bits

Notice that bits 12 and 11 (Machine Previous Privilege Mode [MPP]) are WPRI, which stands for Write-Preserve, Read-Ignore. Write-preserve means that if we write to bits 12 and 11, it will preserve their original value, which essentially prevents us from writing to MPP in supervisor mode. Read-ignore means that if we try to read bits 12 and 11, we won’t get the actual data. Instead, it ignores the data and usually will give us 0.

This changes the way we switch from supervisor mode into user mode. So, if we put 1 into SPP (bit 8 of sstatus), then when we execute the sret (supervisor return), then this bit will become the privilege level. Recall that level 1 is supervisor mode. If we put 0 in bit 8, then after an sret, we will be in user mode. Recall that user mode is level 0.


Complications with Interrupts

Interrupts trigger a pin on the CPU to cause a trap. This is usually in response to something, such as a timer, a software interrupt, or an external, platform-level interrupt, such as a keyboard input or wifi notification.

The interrupts I’m concerned about are at the platform level. RISC-V has a specification for the platform-level interrupt controller, or PLIC. In this, we can configure the PLIC to trap in supervisor mode or even in machine mode. Since the PLIC is directly connected to the CPU, we, at supervisor mode, can tell the PLIC where to send interrupts. This makes our job a little bit harder since there are many different configurations of multiple-harts, different hart configurations and so on.

To demonstrate this, here’s the memory map of the PLIC on a 5-hart CPU where the 0-hart only has M mode, and harts 1, 2, 3, and 4 have both M and S mode.

FU540 PLIC Memory Map

As you can see, our stride isn’t the same for every hart, so we will have to configure our operating system nearly at machine-level. If you take a look at the virt cpu in QEMU (see qemu/virt.h at master · qemu/qemu (github.com)). So, I can’t go off of just configuring each hart to enable an interrupt, I have to specify the mode where I want the interrupt to be trapped. Furthermore, each hart is not necessarily the same, as you can see with the diagram above. the FU540 has hart 0 (called the “monitor core”) to supervise the other cores, which runs in machine mode only.

Traps also require us to context switch, by saving the current set of registers into the context running on the hart, scheduling the next context, and loading that context onto the hart. This was fairly simple in machine mode for one reason–the MMU is turned off automatically in machine mode. This is not the case in supervisor mode. Furthermore, the MMU register, called supervisor address translation and protection (SATP), is immediate. Meaning, if I set the mode, the MMU immediately turns on. This can be a problem because I have to juggle certain registers. Take a look at the trap handler written in RISC-V assembly below.

Supervisor Address Translation and Protection Register Handler

(Updated screenshot): I originally had sstatus instead of sscratch. The point of csrrw is to make an atomic swap of sscratch into t6 and the old value of t6 into sscratch. This allows us to keep both values. As a side note, this is actually the second time I’ve done this. My typing fingers just like sstatus better than sscratch.

As you can see, we have to be careful not destroy a register in our context. I usually use the t6 register since it is register number 31, which is the last register for an ascending loop. In the code above, I’m making sure that no memory accesses are made after the SATP register is written to. Remember, it’s immediate. As soon as I write to the SATP register and set the mode, it is up and running since we’re in supervisor mode.

This leads us to a little bit of a problem. Unless we map this handler, it will not be able to execute–and we still need to get to sret. Recall that X (execute) is one of our bits and so is the U (user) bit. So, how do we handle this? We will see in the next post. Stay tuned.


Conclusion

I’m still working on migrating my machine-mode operating system into a supervisor-mode operating system. This is a work in progress, so I encourage you to keep up to date on this blog!

Thanks for reading!

Getting Graphical Output from our Custom RISC-V Operating System in Rust

An operating system is used to make our job easier when using graphics. In our instance, in addition to everything else. In this post, we will be writing a GPU (graphics processing unit) driver using the VirtIO specification. In here, we will allow user applications to have a portion of the screen as RAM–with what is commonly known as a framebuffer.


Contents

  1. Overview
  2. Pixels and Resolution
  3. The GPU VirtIO Device
  4. Initialization
  5. Invalidation and Transfer
  6. Device Responses
  7. User Space
  8. Simple Graphics API
  9. Conclusions and Further Reading

Overview

We command the virtual GPU (virtio-gpu) by sending certain commands to the host (the device). The guest (the OS driver) has an allocation of RAM that becomes the framebuffer. The driver then tells the device, “hey, here’s the RAM that we’re going to use to store pixel information.”

The RAM is contiguous in our OS, but according to the specification, this isn’t strictly required. We will give the driver a rectangle. Everything that falls within that rectangle will be copied to the host. We don’t want to keep copying the entire buffer over and over again.

We will be using the virtio protocol that we used for the block driver here, so I won’t rehash the general virtio protocol. However, the device-specific structures are a bit different, so we’ll cover that part more in depth.


Pixels and Resolution

A framebuffer must be large enough to store \(\text{width}\times\text{height}\times\text{pixel size}\) number of bytes. There are \(\text{width}\times\text{height}\) number of pixels. Each pixel has a 1-byte red, green, blue, and alpha channels. So, each pixel is exactly 4 bytes with the configuration we’re going to specify.

The framebuffer for our junior GPU driver is going to support a fixed resolution of \(640\times 480\). If you’re a child of the 90s, you saw this resolution a lot. In fact, my first computer, a Laser Pal 386, had a 16-color monitor with a resolution of 640 pixels wide with 480 pixels tall.

There are red, green, and blue pixels so close together that by varying the intensity of these three channels, we can change the color. The closer we get to our monitors, the easier a pixel is to see.

Pixels on a Viewsonic VX2770SMH-LED monitor.

You can see these little squares. If you squint enough, you can see that they aren’t pure white. Instead, you can see bits of red, blue, and green. That’s because each one of these little squares is subdivided into three colors: yep, red, green, and blue! To make white, these pixels are turned up to 11 (get the joke?). To make black, we turn off all three channels of that pixel.

The resolution refers to how many of these squares are on our monitor. This is a 1920×1080 monitor. That means that there are 1920 of these squares going left to right, and there are 1080 of these squares from top to bottom. All in all, we have \(1920\times 1080=2,073,600\) number of pixels. Each one of these pixels is expressed using 4 bytes in the framebuffer, meaning we need \(2,073,600\times 4=8,294,400\) bytes in RAM to store the pixel information.

You can see why I limited our resolution to 640×480, which only requires \(640\times 480\times 4=1,228,800\) bytes–a bit over a megabyte.


The GPU VirtIO Device

The GPU device requires us to read a more up-to-date VirtIO specification. I’ll be reading from version 1.1, which you can get a copy here: https://docs.oasis-open.org/virtio/virtio/v1.1/virtio-v1.1.html. Specifically, chapter 5.7 “GPU Device”. This is an unaccelerated 2D device, meaning that we must use the CPU to actually form the framebuffer, then we transfer our CPU formulated memory location to the host GPU, which is then responsible for drawing it to the screen.

The device uses a request/response system, where we the driver make a command to request something from the host (the GPU). We add a bit of extra memory into our request so that the host can formulate its response. When the GPU interrupts us, we can take a look at this response memory location to see what the GPU told us. This is much like the status field on the block driver, where the block device tells us the status of our last request.

Each request starts with a Command Header, which in Rust looks as follows:

#[repr(C)]
struct CtrlHeader {
	ctrl_type: CtrlType,
	flags: u32,
	fence_id: u64,
	ctx_id: u32,
	padding: u32
}

The header is common for all requests and all responses. We can differentiate by the CtrlType enumeration, which is:

#[repr(u32)]
enum CtrlType {
	/* 2d commands */
	CmdGetDisplayInfo = 0x0100,
	CmdResourceCreate2d,
	CmdResourceUref,
	CmdSetScanout,
	CmdResourceFlush,
	CmdTransferToHost2d,
	CmdResourceAttachBacking,
	CmdResourceDetachBacking,
	CmdGetCapsetInfo,
	CmdGetCapset,
	CmdGetEdid,
	/* cursor commands */
	CmdUpdateCursor = 0x0300,
	CmdMoveCursor,
	/* success responses */
	RespOkNoData = 0x1100,
	RespOkDisplayInfo,
	RespOkCapsetInfo,
	RespOkCapset,
	RespOkEdid,
	/* error responses */
	RespErrUnspec = 0x1200,
	RespErrOutOfMemory,
	RespErrInvalidScanoutId,
	RespErrInvalidResourceId,
	RespErrInvalidContextId,
	RespErrInvalidParameter,
}

I took this directly from the specification, but Rust-ified the names to avoid getting yelled at by the linter.

Pixel Formats

Recall that the framebuffer is just a bunch of bytes in memory. We need to put a structure behind the framebuffer so the host (the GPU) knows how to interpret your sequence of bytes. There are several formats, but all-in-all, they just re-arrange the red, green, blue, and alpha channels. All are exactly 4 bytes, which makes the stride the same. The stride is the spacing from one pixel to another–4 bytes.

#[repr(u32)]
enum Formats {
	B8G8R8A8Unorm = 1,
	B8G8R8X8Unorm = 2,
	A8R8G8B8Unorm = 3,
	X8R8G8B8Unorm = 4,
	R8G8B8A8Unorm = 67,
	X8B8G8R8Unorm = 68,
	A8B8G8R8Unorm = 121,
	R8G8B8X8Unorm = 134,
}

The type, unorm, is an 8-bit (1-byte) unsigned value from 0 through 255, where 0 represents no intensity and 255 represents full intensity, and a number in between is a linear-interpolation between no and full intensity. Since there are three color (and one alpha), that gives us \(256\times 256\times 256=16,776,216\) different colors or levels of colors.

For this tutorial, I selected R8G8B8A8Unorm = 67, which has red first, green second, blue third, and alpha fourth. This is a common ordering, so I’ll select it to make it easy to follow along.

Our selected format makes the pixel structure look as follows:

Recall that each individual component R, G, B, and A are each one byte a piece, so each Pixel referred to by (x, y) is 4 bytes. This is why our memory pointer is a Pixel structure instead of a byte.


Initialization

Just like all other virtio devices, we set up the virtqueues first and then we work on device-specific initialization. In my code, I just directly copied-and-pasted from the block driver into the gpu driver. The only thing I added to the Device structure was the framebuffer and dimensions of the framebuffer.

pub struct Device {
	queue:        *mut Queue,
	dev:          *mut u32,
	idx:          u16,
	ack_used_idx: u16,
	framebuffer:  *mut Pixel,
	width:        u32,
	height:       u32,
}

The specification tells us to do the following in order to initialize the device and get things ready to draw. I Rust-ified some of the content to match our enumerations.

Create a framebuffer and configure scanout

  1. Create a host resource using CmdResourceCreate2d.
  2. Allocate a framebuffer from guest ram, and attach it as backing storage to the resource just created, using CmdResourceAttachBacking.
  3. Use CmdSetScanout to link the framebuffer to a display scanout.

A Request Structure

Recall that our request and response come packaged together. We will put them in separate descriptors, but whenever we get a response back from the device, it is going to be easier if we free just once to free both the request and response. So, in Rust, I created the Request structure to support doing this.

struct Request<RqT, RpT> {
	request: RqT,
	response: RpT,
}
impl<RqT, RpT> Request<RqT, RpT> {
	pub fn new(request: RqT) -> *mut Self {
		let sz = size_of::<RqT>() + size_of::<RpT>();
		let ptr = kmalloc(sz) as *mut Self;
		unsafe {
			(*ptr).request = request;
		}
		ptr
	}
}

Step 1: Create host resource

let rq = Request::new(ResourceCreate2d {
	hdr: CtrlHeader {
		ctrl_type: CtrlType::CmdResourceCreate2d,
		flags: 0,
		fence_id: 0,
		ctx_id: 0,
		padding: 0,
	},
	resource_id: 1,
	format: Formats::R8G8B8A8Unorm,
	width: dev.width,
	height: dev.height,
});
let desc_c2d = Descriptor {
	addr: unsafe { &(*rq).request as *const ResourceCreate2d as u64 },
	len: size_of::<ResourceCreate2d>() as u32,
	flags: VIRTIO_DESC_F_NEXT,
	next: (dev.idx + 1) % VIRTIO_RING_SIZE as u16,
};
let desc_c2d_resp = Descriptor {
	addr: unsafe { &(*rq).response as *const CtrlHeader as u64 },
	len: size_of::<CtrlHeader>() as u32,
	flags: VIRTIO_DESC_F_WRITE,
	next: 0,
};
unsafe {
	let head = dev.idx;
	(*dev.queue).desc[dev.idx as usize] = desc_c2d;
	dev.idx = (dev.idx + 1) % VIRTIO_RING_SIZE as u16;
	(*dev.queue).desc[dev.idx as usize] = desc_c2d_resp;
	dev.idx = (dev.idx + 1) % VIRTIO_RING_SIZE as u16;
	(*dev.queue).avail.ring[(*dev.queue).avail.idx as usize % VIRTIO_RING_SIZE] = head;
	(*dev.queue).avail.idx = (*dev.queue).avail.idx.wrapping_add(1);
}

All we’re really telling the GPU here is our resolution and the format of the framebuffer. When we create this, the host gets to configure itself, such as allocating an identical buffer to make transfers from our OS.

Step 2: Attach framebuffer backing.

let rq = Request3::new(AttachBacking {
	hdr: CtrlHeader {
		ctrl_type: CtrlType::CmdResourceAttachBacking,
		flags: 0,
		fence_id: 0,
		ctx_id: 0,
		padding: 0,
	},
	resource_id: 1,
	nr_entries: 1,
},
MemEntry {
	addr: dev.framebuffer as u64,
	length: dev.width * dev.height * size_of::<Pixel>() as u32,
	padding: 0, 
}
);
let desc_ab = Descriptor {
	addr: unsafe { &(*rq).request as *const AttachBacking as u64 },
	len: size_of::<AttachBacking>() as u32,
	flags: VIRTIO_DESC_F_NEXT,
	next: (dev.idx + 1) % VIRTIO_RING_SIZE as u16,
};
let desc_ab_mementry = Descriptor {
	addr: unsafe { &(*rq).mementries as *const MemEntry as u64 },
	len: size_of::<MemEntry>() as u32,
	flags: VIRTIO_DESC_F_NEXT,
	next: (dev.idx + 2) % VIRTIO_RING_SIZE as u16,
};
let desc_ab_resp = Descriptor {
	addr: unsafe { &(*rq).response as *const CtrlHeader as u64 },
	len: size_of::<CtrlHeader>() as u32,
	flags: VIRTIO_DESC_F_WRITE,
	next: 0,
};
unsafe {
	let head = dev.idx;
	(*dev.queue).desc[dev.idx as usize] = desc_ab;
	dev.idx = (dev.idx + 1) % VIRTIO_RING_SIZE as u16;
	(*dev.queue).desc[dev.idx as usize] = desc_ab_mementry;
	dev.idx = (dev.idx + 1) % VIRTIO_RING_SIZE as u16;
	(*dev.queue).desc[dev.idx as usize] = desc_ab_resp;
	dev.idx = (dev.idx + 1) % VIRTIO_RING_SIZE as u16;
	(*dev.queue).avail.ring[(*dev.queue).avail.idx as usize % VIRTIO_RING_SIZE] = head;
	(*dev.queue).avail.idx = (*dev.queue).avail.idx.wrapping_add(1);
}

The backing is exposed to the GPU through the MemEntry structure. This essentially is a physical address in guest RAM. The MemEntry, aside from padding, is just a pointer and a length.

Notice that I created a new structure called Request3. This is because this step requires three separate descriptors: (1) the header, (2) the mementry, (3) the response, whereas usually we only need two descriptors. Our structure is much like a normal Request, but it includes the mementries.

struct Request3<RqT, RmT, RpT> {
   request: RqT,
   mementries: RmT,
   response: RpT,
}
impl<RqT, RmT, RpT> Request3<RqT, RmT, RpT> {
   pub fn new(request: RqT, meminfo: RmT) -> *mut Self {
      let sz = size_of::<RqT>() + size_of::<RmT>() + size_of::<RpT>();
      let ptr = kmalloc(sz) as *mut Self;
      unsafe {
         (*ptr).request = request;
         (*ptr).mementries = meminfo;
      }
      ptr
   }
}

Step 3: Set Scanout

let rq = Request::new(SetScanout {
	hdr: CtrlHeader {
		ctrl_type: CtrlType::CmdSetScanout,
		flags: 0,
		fence_id: 0,
		ctx_id: 0,
		padding: 0,
	},
	r: Rect::new(0, 0, dev.width, dev.height),
	resource_id: 1,
	scanout_id: 0,
});
let desc_sso = Descriptor {
	addr: unsafe { &(*rq).request as *const SetScanout as u64 },
	len: size_of::<SetScanout>() as u32,
	flags: VIRTIO_DESC_F_NEXT,
	next: (dev.idx + 1) % VIRTIO_RING_SIZE as u16,
};
let desc_sso_resp = Descriptor {
	addr: unsafe { &(*rq).response as *const CtrlHeader as u64 },
	len: size_of::<CtrlHeader>() as u32,
	flags: VIRTIO_DESC_F_WRITE,
	next: 0,
};
unsafe {
	let head = dev.idx;
	(*dev.queue).desc[dev.idx as usize] = desc_sso;
	dev.idx = (dev.idx + 1) % VIRTIO_RING_SIZE as u16;
	(*dev.queue).desc[dev.idx as usize] = desc_sso_resp;
	dev.idx = (dev.idx + 1) % VIRTIO_RING_SIZE as u16;
	(*dev.queue).avail.ring[(*dev.queue).avail.idx as usize % VIRTIO_RING_SIZE] = head;
	(*dev.queue).avail.idx = (*dev.queue).avail.idx.wrapping_add(1);
}

When we want to write to a buffer, we will refer to it by its scanout number. If we had two scanouts, we could draw on one while the other is displayed to the screen. This is called double-buffering, but for our purposes, we don’t do this. Instead, we draw on the same framebuffer, then transfer certain portions for the GPU to update the display.

After we signal QueueNotify, the virtio register “GO” button, then the GPU will create a new buffer internally, set the backing store, and set the scanout number to this buffer. We now have an initialized framebuffer!


Invalidation and Transfer

We now have memory that contains pixels. However, we have our own memory, and the GPU has its own memory. So, to get ours to the GPU, it needs to be transferred. We set the backing store during initialization, so we now only have to refer to what we want updated by its scanout number.

Invalidation is important, since updating the entire screen every time we make a change is very expensive. In fact, if we transfer our entire screen, we need to transfer \(640\times 480\times 4=1,228,800\) bytes. For framerates, such as 20 or 30 frames per second, we need to transfer this number of bytes 20 or 30 times a second!

Instead of transferring everything, we invalidate certain portions of the framebuffer, and the GPU will only copy over those Pixels that fall within the invalidated region, whose coordinates are defined by a Rect structure.

#[repr(C)]
#[derive(Clone, Copy)]
pub struct Rect {
	pub x: u32,
	pub y: u32,
	pub width: u32,
	pub height: u32,
}
impl Rect {
	pub const fn new(x: u32, y: u32, width: u32, height: u32) -> Self {
		Self {
			x, y, width, height
		}
	}
}

Notice that this Rect is defined by an upper-left coordinate (x, y) and then a width and height. Rectangles can be defined by their coordinates (x1, y1), (x2, y2) or an initial coordinate and width and height. I don’t see anything in the spec about the former, but when I try to invalidate and transfer, it appears that it’s treating the rectangle as the latter. Oh well, more testing I guess…

Invalidating

Invalidating is just transferring the data from the guest (driver) to the host (GPU). This just copies the memory, to update the framebuffer, we execute a flush command.

pub fn transfer(gdev: usize, x: u32, y: u32, width: u32, height: u32) {
   if let Some(mut dev) = unsafe { GPU_DEVICES[gdev-1].take() } {
      let rq = Request::new(TransferToHost2d {
         hdr: CtrlHeader {
		ctrl_type: CtrlType::CmdTransferToHost2d,
		flags: 0,
		fence_id: 0,
		ctx_id: 0,
		padding: 0,
         },
	r: Rect::new(x, y, width, height),
	offset: 0,
	resource_id: 1,
	padding: 0,
	});
	let desc_t2h = Descriptor {
		addr: unsafe { &(*rq).request as *const TransferToHost2d as u64 },
		len: size_of::<TransferToHost2d>() as u32,
		flags: VIRTIO_DESC_F_NEXT,
		next: (dev.idx + 1) % VIRTIO_RING_SIZE as u16,
	};
	let desc_t2h_resp = Descriptor {
		addr: unsafe { &(*rq).response as *const CtrlHeader as u64 },
		len: size_of::<CtrlHeader>() as u32,
		flags: VIRTIO_DESC_F_WRITE,
		next: 0,
	};
	unsafe {
		let head = dev.idx;
		(*dev.queue).desc[dev.idx as usize] = desc_t2h;
		dev.idx = (dev.idx + 1) % VIRTIO_RING_SIZE as u16;
		(*dev.queue).desc[dev.idx as usize] = desc_t2h_resp;
		dev.idx = (dev.idx + 1) % VIRTIO_RING_SIZE as u16;
		(*dev.queue).avail.ring[(*dev.queue).avail.idx as usize % VIRTIO_RING_SIZE] = head;
		(*dev.queue).avail.idx = (*dev.queue).avail.idx.wrapping_add(1);
	}
	// Step 5: Flush
	let rq = Request::new(ResourceFlush {
		hdr: CtrlHeader {
			ctrl_type: CtrlType::CmdResourceFlush,
			flags: 0,
			fence_id: 0,
			ctx_id: 0,
			padding: 0,
		},
		r: Rect::new(x, y, width, height),
		resource_id: 1,
		padding: 0,
	});
	let desc_rf = Descriptor {
		addr: unsafe { &(*rq).request as *const ResourceFlush as u64 },
		len: size_of::<ResourceFlush>() as u32,
		flags: VIRTIO_DESC_F_NEXT,
		next: (dev.idx + 1) % VIRTIO_RING_SIZE as u16,
	};
	let desc_rf_resp = Descriptor {
		addr: unsafe { &(*rq).response as *const CtrlHeader as u64 },
		len: size_of::<CtrlHeader>() as u32,
		flags: VIRTIO_DESC_F_WRITE,
		next: 0,
	};
	unsafe {
		let head = dev.idx;
		(*dev.queue).desc[dev.idx as usize] = desc_rf;
		dev.idx = (dev.idx + 1) % VIRTIO_RING_SIZE as u16;
		(*dev.queue).desc[dev.idx as usize] = desc_rf_resp;
		dev.idx = (dev.idx + 1) % VIRTIO_RING_SIZE as u16;
		(*dev.queue).avail.ring[(*dev.queue).avail.idx as usize % VIRTIO_RING_SIZE] = head;
		(*dev.queue).avail.idx = (*dev.queue).avail.idx.wrapping_add(1);
	}
	// Run Queue
	unsafe {
		dev.dev
		.add(MmioOffsets::QueueNotify.scale32())
		.write_volatile(0);
		GPU_DEVICES[gdev-1].replace(dev);
	}
}

So, our transfer first tells the host that we’ve updated a certain portion of the framebuffer, which is specified as x, y, width, and height. Then we do what is called a resource flush to get the GPU to commit all transfers to the screen.


Device Responses

This is a fairly easy section. Most of the device responses come in the form of NODATA, which is just an acknowledgment that it made the request. Also, notice that unlike the block driver, we don’t have watchers here. This allows us to asynchronously update the screen.


User space

The whole point of this is to get a user space application drawing stuff to the screen. Generally, we wouldn’t give the full framebuffer to any user space application that wants it, but for our purposes, we can live with it for now. Instead, we would have a window manager delegate certain rectangles of the framebuffer to different applications. The window manager would also be responsible for handling events and sending the appropriate events to the GUI application.

System Calls

To allow our userspace applications to use the GPU, we need two system calls. One to get a pointer to the framebuffer. Recall that we first must map the framebuffer to the userspace’s MMU table. This is why we allocated pages instead of using kmalloc.

let dev = (*frame).regs[Registers::A0 as usize];
(*frame).regs[Registers::A0 as usize] = 0;
if dev > 0 && dev <= 8 {
	if let Some(p) = gpu::GPU_DEVICES[dev - 1].take() {
		let ptr = p.get_framebuffer() as usize;
		gpu::GPU_DEVICES[dev-1].replace(p);
		if (*frame).satp >> 60 != 0 {
			let p = get_by_pid((*frame).pid as u16);
			let table = ((*p).get_table_address()
						 as *mut Table)
						.as_mut()
						.unwrap();
            let num_pages = (p.get_width() * p.get_height() * 4) as usize / PAGE_SIZE;
			for i in 0..num_pages {
				let vaddr = 0x3000_0000 + (i << 12);
				let paddr = ptr + (i << 12);
				map(table, vaddr, paddr, EntryBits::UserReadWrite as i64, 0);
			}
		}
		(*frame).regs[Registers::A0 as usize] = 0x3000_0000;
	}
}

As you can see above, we grab the framebuffer from the GPU device and map it to 0x3000_0000. Currently, I calculate the number of pages for the framebuffer, which is \(\frac{640\times 480\times 4}{4,096}=300\). So, we need exactly 300 pages for this resolution.

So, now we have a framebuffer, so the userspace application can write what it wants into this memory location. However, a write doesn’t immediately update the screen. Recall that we must transfer and then flush to get the results written to the screen. This is where our second system call comes into play.

let dev = (*frame).regs[Registers::A0 as usize];
let x = (*frame).regs[Registers::A1 as usize] as u32;
let y = (*frame).regs[Registers::A2 as usize] as u32;
let width = (*frame).regs[Registers::A3 as usize] as u32;
let height = (*frame).regs[Registers::A4 as usize] as u32;

gpu::transfer(dev, x, y, width, height);

I showed the transfer function above, which just makes two requests: (1) CmdTransferToHost2d and (2) CmdResourceFlush. When the userspace application makes this system call, the results will be flushed to the screen and hence, it’ll be visible to the user. I don’t error check in the system call itself. The transfer function will error check the device, and the device will error check the x, y, width, and height. So, if this is incorrect, the transfer function will silently fail, and nothing will update to the screen.


Simple Graphics API

To see something displayed to the screen, we need to be able to draw the simplest things, rectangles. If we have a width of the rectangle small enough, we can draw straight lines–horizontally or vertically!

Drawing Rectangles

We are given a contiguous piece of memory in row-major format. That means that we exhaust each column in a row before we move to the next row. So, framebuffer[0] and framebuffer[1] are columns 0 and 1 of row 0. The calculation is fairly straight forward to get to the next row, we must go one past the last column. So, the formula becomes:

$$\text{byte}=\text{row}\times \text{width}+\text{column}$$

struct Pixel {
	unsigned char r;
	unsigned char g;
	unsigned char b;
	unsigned char a;
};
void set_pixel(Pixel *fb, u32 x, u32 y, Pixel &color) {
   // x is column, y is row
   if (x < 640 && y < 480) {
      fb[y * 640 + x] = color;
   }
}

So, the function above writes to a single Pixel. This structure is a 4-byte structure containing red, green, blue, and alpha bytes. However, we want two different types of rectangle drawing: fill and stroke. Fill will fill the area of the rectangle with the given Pixel structure (color) whereas stroke is just the outline of a rectangle.

void fill_rect(Pixel *fb, u32 x, u32 y, u32 width, u32 height, Pixel &color) {
   for (u32 row = y; row < (y+height);row++) {
      for (u32 col = x; col < (x+width);col++) {
         set_pixel(fb, col, row, color);
      }
   }
}
void stroke_rect(Pixel *fb, u32 x, u32 y, u32 width, u32 height, Pixel &color, u32 size) {
   // Essentially fill the four sides.
   // Top
   fill_rect(fb, x, y, width, size, color);
   // Bottom
   fill_rect(fb, x, y + height, width, size, color);
   // Left
   fill_rect(fb, x, y, size, height, color);
   // Right
   fill_rect(fb, x + width, y, size, height + size, color);
}

Trigonometry

Of course, when I tried to brag about drawing rectangles to a friend of mine, he mentions the following.

Oh no…I don’t have cos/sin/tan or anything like that in my OS. I couldn’t say no, and I couldn’t be beaten by a simple cosine, right? Challenge accepted.

I ended up writing a cosine function based on an infinite series, but he took it several steps further and wrote several ways and benchmarked them to see which was better in terms of memory footprint, accuracy, and speed (see link below in Conclusions and Further Reading). Here’s mine:

f64 cos(f64 angle_degrees) {
	f64 x = 3.14159265359 * angle_degrees / 180.0;
	f64 result = 1.0;
	f64 inter = 1.0;
	f64 num = x * x;
	for (int i = 1;i <= 6;i++) {
		u64 comp = 2 * i;
		u64 den = comp * (comp - 1);
		inter *= num / den;
		if ((i & 1) == 0) {
			result += inter;
		}
		else {
			result -= inter;
		}
	}
	return result;
}

This is an infinite series, but we can get more accuracy with more terms. For a compromise, the for loop’s termination, i <= 6, is the number of terms, so 6 terms gives us alright accuracy for graphics, at least from what I can visually tell on a \(640\times 480\) screen.


Testing

Now, the fun part. Let’s see if this works! Here’s our userspace code.

int main() {   
   Pixel *fb = (Pixel *)syscall_get_fb(6);
   Pixel blue_color = {0, 0, 255, 255};
   Pixel red_color = {255, 0, 0, 255};
   Pixel green_color = {0, 255, 0, 255};
   Pixel white_color = {255, 255, 255, 255};
   Pixel orange_color = {255, 150, 0, 255};

   fill_rect(fb, 0, 0, 640, 480, white_color);
   stroke_rect(fb, 10, 10, 20, 20, blue_color, 5);
   stroke_rect(fb, 50, 50, 40, 40, green_color, 10);
   stroke_rect(fb, 150, 150, 140, 140, red_color, 15);
   fill_rect(fb, 10, 300, 500, 100, orange_color);
   syscall_inv_rect(6, 0, 0, 640, 480);
   return 0;
}

And here’s the result!

Let’s add in our cosine function and see what happens!

void draw_cosine(Pixel *fb, u32 x, u32 y, u32 width, u32 height, Pixel &color) {
   for (u32 i = 0; i < width;i++) {
      f64 fy = -cos(i % 360);
      f64 yy = fy / 2.0 * height;
      u32 nx = x + i;
      u32 ny = yy + y;
      fill_rect(fb, nx, ny, 2, 2, color);
      }
}

That’s looking good.


Conclusion

Our operating system is starting to look more and more like a normal operating system. We still need an input system so that we can interact with our operating system, but that’ll be the next thing we tackle.

Sometime in the future, we will compile newlib so that we have a standard library in userspace. Right now, we’re forced to write our own functions.

For a great read regarding cosine and the challenges with it, head on over to Dr. Austin Henley’s blog on cosine.

Hooking up our Custom OS to a Standard Library

The standard library contains a ton of code that we don’t want to write ourselves, including printf, scanf, math functions, and so forth. So, we need to make sure our operating system can link to this library and everything “just works”. This post will show you how I linked our operating system to a standard library, newlib, and the trials and tribulations encountered in doing so.


Contents


Introduction

Libraries allow programmers to start writing programs at a higher level. If anyone remembers back to the 80s where personal computers would boot to a BASIC editor, you essentially had to write your programs from scratch.

The term library just means some place to store code. Generally, this code is an object file of compiled and assembled source code. Shared objects can also be loaded on-demand, but this requires a little bit of extra support from a dynamic linker. See my post about dynamic linking here: Dynamic Linking.


The Library

Many of the most useful routines will be written once and then stored into either a shared object (so) or an archive (a). We know that libraries allow us to pull in already-written code, but there is a more fundamental reason for a library–to make interfacing with the operating system easy.

If you’ve read my blog posts on a RISC-V operating system using Rust, you will know how an application can make a request from the operating system itself. Generally, this is done through system calls. These system calls are given a number. For example, in Linux on an x86-64 system, system call #0 is the exit system call. However, nothing really says that we have to use these numbers.

If we take a look at libgloss, which is a low-level library written for newlib, we can see the following standard:

libgloss syscall.h file

System Calls

So, as you can see, the job of a low-level library is to make sure the arguments are in the right location, the system call number is in the correct register, and that the system call is actually made. For RISC-V, the final instruction we execute is ecall, for “environment call”. This instruction will jump us into the operating system. The OS will handle this by first understanding that it is handling a system call. Second, it will look at the system call number and route it to the correct routine.

In my OS blog, you can see I do the same thing here: https://osblog.stephenmarz.com/ch7.html

We can see this in terms of a common function, such as printf(). This function will first use the application to form a full string. So, something like printf(“Hello %s”, “Stephen”) will cause the function to build a string that reads “Hello Stephen\0”. The final step is for printf() to print it to the stdout file handle. Barring any redirections, this is generally the console. So, the library function(s) must eventually call the write system call with the file descriptor (number) listed first, a pointer to the string second, and finally the number of bytes to write third.

At the end of a system call, control is handed back to the user application. In this case, printf() will have to handle whatever return value is received from the write system call. If we perform a system call trace, or strace, we can see that indeed write is the end goal for printf.

In the output above, we can see that printf() decided to call write with file descriptor 1, which is standard output, the string buffer “Hello Stephen”, which is lastly 13 characters. Notice that printf() must be able to find the NULL-terminator (\0) to count the number of printing characters.


Our Kernel

When we look in our kernel, we have to be able to handle how the library makes the system call. As you can see, most of these system calls follow the UNIX SYSV convention, such as exit, read, write, and so forth, and where everything this a file descriptor, including network sockets and actual files.

The example below shows how I implemented the open system call. You can see that since the application is using virtual memory, I am required to find out where that is in physical memory so the kernel can go find it. Mixing the two or crossing streams is never a good idea.

In the code above, I check to see if the path given by the user (the first parameter) is somewhere in the virtual file system. If it is, then we can create a new file descriptor and link it to that file. As you can tell, this system call is by no means completely able to handle all of the different types of files out there, including nodes (sockets, fifos, etc).


Guarantees

The C++ standard actually contains information about the standard library. This includes complexity guarantees and memory footprints.

The library would also be best to know about all of the exploits of the underlying architecture. For example, strcpy (string copy) can exploit the AVX or SSE extensions if the CPU supports them.

Looking at the C++ 2011 standard, which I found for free here http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2011/n3242.pdf outlines the different standard functions that need to be supported to be considered C++ standard.

Here’s just one example of the requirements for the container library on page 717 (731 is the PDF page).


Starting and Ending

Another interesting part of libraries is that some portions of them can help the language function properly. These libraries are usually called runtimes. Most languages now have a runtime, which is code that is executed when the program runs, rather than when it is compiled or linked.

How many of you really knew that int main is not our real entry point for running a program? Instead, this belongs to a memory label called _start. This label is the ELF (executable and linkable format) entry point where the operating system will set your application to start running.

The _start will eventually call int main, but it has to set up some things include command line arguments, or at the very least putting them into the int argc, and char *argv[] parameters.

We can see the _start routines in the crt0.S assembly file, which stands for C-runtime. The 0 is the first runtime, as more runtimes can be added in different files, such as crt1.S and so forth.

The _start routine

You can see in the code above, before main is even called, the BSS (uninitialized global variables) is cleared to 0, the global pointer (gp) is set, and the atexit functions (global termination functions) are registered. Finally, argc, argv, and envp all get their proper locations. For RISC-V, this is a0, a1, and a2.

You can see that the last thing to occur is the exit call. This will take the return value of main, which will be in the register a0. The tail instruction means that it will not return. Both call and tail are pseudo-instructions in RISC-V, which can be seen in the RISC-V specification:

tail writes the return address to the zero register, effectively discarding it.

Dynamic linking requires us to parse the executable and linkable format (ELF) a little bit more robustly than we did in our RISC-V OS in Rust. The dynamic linker requires an executable interpreter that lives somewhere on the storage device. You can actually look at an executable file to see what interpreter it will request when those dynamically-linked functions are called.

readelf -l myexec

You can see the executable file above is going to use /lib64/ld-linux-x86-64.so.2 to run this simple test executable that I compiled. We can actually run that interpreter and see what it says:

Invoking the dynamic linker directly

As you can see, the INTERP is a particular program header that our operating system would have to be able to handle and spawn a thread to the dynamic linker. However, this was way more than our custom operating system could handle. So now, our operating system will require that the standard library be statically linked into our program. This means that all of the functions and routines that are needed by the program will have to be stored in the executable. This significantly increases the size of our program depending on the number of library function calls made in the lifetime of the executable.


Conclusion

I haven’t even begun to dive deeply at all that goes into designing and creating a library. In my software engineering courses, I show that libraries can be a great tool to aggregate many different teams’ code into a single project. Keep in mind that I’m looking at this with a fairly low-level lens.

Most libraries have a particular architecture and operating system in mind. This leads to some challenges with a custom OS. In fact, more often than not, I ended up copying Linux’s conventions just to avoid pain later. In the grand scheme of things, this might not seem to be a big deal. However, it forces you to think a certain way about an operating system.

Getting a big philosophical–Linux has changed over the years, but it still has a core philosophy. Many operating systems that were derived from it, including Android, are now being rendered obsolete for a new, refreshing look at operating systems. In fact, Google has started the Fuchsia project, which aims to build a modular operating system to replace Android in its mobile devices.

Talking with our new Operating System by Handling Input Events and Devices

Input devices give our operating system the ability to accept mouse and keyboard inputs from the graphical user interface (GUI). We originally used the UART (universal asynchronous receiver/transmitter) for console I/O, but we’re high class now–we have a GUI! So, we will use the virtio protocol to communicate with a mouse and keyboard.


Contents

  1. VirtIO Input Device Protocol
  2. Tablet vs. Mouse (absolute vs. relative positioning)
  3. Event and Status Queues
  4. Event System
  5. The Device’s Perspective
  6. Event Queues
  7. System Calls
  8. ABS Events
  9. The Final Product
  10. Additional Reading

VirtIO Input Device Protocol

Yet again, we are using the virtio protocol, but this time we’re targeting the input device (device id: 18). In particular, we will look at the keyboard inputs as well as mouse inputs. In our case, we’re going to use the mouse as a tablet–to get absolute positioning since we don’t own the entire screen.

So, we look for device ID 18 and redirect it to initialize in the input.rs file. We will actually have two devices linking here: the mouse and the keyboard. One thing about input is we get a stream of data from the input devices, whether that’s key codes or mouse inputs, so we need to make sure that we’re notified of every little input. On my system, QEMU offers RING_EVENT_IDX, which is a way to suppress notifications until a certain index. However, if we did this, our inputs would be suppressed until a certain number of inputs were made. So, we need to turn this off.

let mut host_features = ptr.add(MmioOffsets::HostFeatures.scale32()).read_volatile();
host_features &= !(1 << VIRTIO_F_RING_EVENT_IDX);
ptr.add(MmioOffsets::GuestFeatures.scale32()).write_volatile(host_features);

Tablet vs. Mouse (Absolute vs. Relative Positioning)

There are two different ways to handle input signals from a mouse or a tablet: (1) relative positioning and (2) absolute position. Relative positioning means that a mouse input gives you the change in position whereas absolute positioning gives you the coordinates of the new position. So, there’s a little bit more math involved with relative positioning. To change this in QEMU, a virtio mouse is relative positioning whereas a virtio tablet is absolute positioning. Both will use the host’s mouse for inputs, so don’t let the terminology throw you off.

The interesting portion about how these events come to you is that we get either an ABS_Y or ABS_X event, which then we look at the event code for the coordinate. The coordinates are also in their own units. When I took a look, the coordinates went from 0 to 32,767, which is the maximum capacity of a signed, 16-bit integer. So, with our screen being 640×480, we need to clamp our values between 0 and 32767. To perform the conversion from input coordinates to screen coordinates, I applied the following equation: \(\text{screencoordx} = \frac{\text{evcoordx}}{32767}\times 480\) and \(\text{screencoordy} = \frac{\text{evcoordy}}{32767}\times 640\).

Essentially, the equation just takes where the current event coordinate is located in reference to the event coordinate space (0 – 32,767). That gives us a ratio. We then apply this ratio to a smaller coordinate space (0 – 640) or (0 – 480). Unfortunately, this equation requires me to use floating point math. There is probably a better way to use integers only, but by the time I started writing this, I haven’t yet done it.

To visualize how this works, we divide the coordinate space into 32,767 and 640 equally spaced distances. Then, we divide the given X coordinate by this coordinate space to see how many distances we get. Since we divided both coordinate spaces equally, we get the same number of distances for 640, it’s just that each spacing between is much smaller. So, we multiply by 640 to convert back into the coordinate space. I know that’s not clear, so hopefully the following diagram will explain.

Both dashed lines have about an equal number of dashes, but because the 0-640 coordinate space is smaller, the distance between each dash and the dash itself are much closer together. So, when we divide the coordinate by 32,767, we are getting back which dash we’re at (which is a number between 0.0 and 1.0). This will equally correspond with the dash in the 0-640 coordinate space. However, to draw to the screen, we need to know the coordinate in that space. So, if we multiply the dash by 640, we get the X coordinate in 0-640 coordinate space.

The vertical axis, Y, is the same way, except that our screen space is 0-480, which means each dash and the distance between each dash are even closer together.

To see this in code form, here it is written in C++:

constexpr u32 lerp(u32 val, u32 mx1, u32 mx2) {
   // mx1 is FROM maximum coordinate and mx2 is TO maximum coordinate
   f64 r = val / static_cast<f64>(mx1);
   return r * mx2;
}

Event and Status Queues

We have two different virtqueues with an input device: (1) the event queue and (2) the status queue. The event queue is populated by the device (mouse or keyboard) with an Event structure, shown below. This is unlike the other queues where we add information to make a request. Since these devices are simply just feeding us input, we need to provide memory locations where these events can go.

The Event structure is the only data structure passed between the input device and the operating system. This is also the only structure used for both the event and status queues.

#[repr(C)]
#[derive(Copy, Clone)]
pub struct Event {
    pub event_type: EventType,
    pub code: u16,
    pub value: u32,
}
#[repr(u16)]
#[derive(Copy, Clone)]
pub enum EventType {
    Syn = 0x00,
    Key = 0x01,
    Rel = 0x02,
    Abs = 0x03,
    Msc = 0x04,
    Sw = 0x05,
    Led = 0x11,
    Snd = 0x12,
    Rep = 0x14,
    Ff = 0x15,
    Pwr = 0x16,
    FfStatus = 0x17,
    Max = 0x1f,
}

The first field of the Event structure is a type. This type tells us what kind of input it was. A Key event is something that came from a button being pushed. This can be a keyboard input or the mouse being pressed. The Abs (absolute) input is a position input. Unlike Rel (relative), each abs input gives us the absolute coordinate. Our input system will give us the X and Y coordinates in different events. Every event has a code and a value. The code is a sub-type of the original event type. In the Abs type, this is going to be ABS_X or ABS_Y to report the x coordinate or y coordinate, respectively.


Event System

The virtio event system is built on top of Linux’s evdev system, and in fact, most of the codes are the exact same. We can see that we will get the same reports that evdev would given key presses or mouse movements. The following is a partial (very partial) list of the key codes that evdev is expecting:

#define KEY_ESC  1
#define KEY_1    2
#define KEY_2    3
#define KEY_3    4
#define KEY_4    5
#define KEY_5    6
// ...
#define KEY_Q    16
#define KEY_W    17
#define KEY_E    18
#define KEY_R    19

So, if we want to detect the W key being pressed, we would look for event type Key and event code 17. In userspace, we can make a rough event handler by using a switch statement:

switch (ev.code) {
   case BTN_MOUSE:
      pressed = (ev.value & 1) == 1;
   break;
   case KEY_O:
      current_color = Pixel {255, 150, 0, 255};
   break;
   case KEY_B:
      current_color = Pixel {0, 0, 255, 255};
   break;
   case KEY_G:
      current_color = Pixel {0, 255, 0, 255};
   break;
   case KEY_R:
      current_color = Pixel {255, 0, 0, 255};
   break;
}

I will explain the other code first, since we need some way of getting events queued from the operating system. However, the code above should just show you how an event translates into human-understandable input.


The Device’s Perspective

The device, the keyboard or mouse, is connected to the virtio bus. It will send inputs to the virtio system, which then in turn looks to find a place to put the event.

The virtio spec is a little bit unclear, so I had to look at the virtio implementation in QEMU and the driver in Linux to fully understand what’s going on. I think the virtio spec is unclear because it leaves the libevdev implementation out. So, I have to figure out what fuzz, res, and all that mean.

The virtio bus will look in the available ring for the next buffer–the buffer is simply an 8-byte piece of memory that’s big enough to store an Event structure. When it finds this, it will “consume it” and put in the Event. This is why we need to make all event buffers device writeable.

unsafe fn repopulate_event(dev: &mut Device, buffer: usize) {
// Populate eventq with buffers, these must be at least the size of struct virtio_input_event.
	let desc = Descriptor {
		addr: dev.event_buffer.add(buffer) as u64,
		len: EVENT_SIZE as u32,
		flags: VIRTIO_DESC_F_WRITE,
		next: 0
	};
	let head = dev.event_idx as u16;
	(*dev.event_queue).desc[dev.event_idx as usize] = desc;
	dev.event_idx = (dev.event_idx + 1) % VIRTIO_RING_SIZE as u16;
	(*dev.event_queue).avail.ring[(*dev.event_queue).avail.idx as usize % VIRTIO_RING_SIZE] = head;
	(*dev.event_queue).avail.idx = (*dev.event_queue).avail.idx.wrapping_add(1);
}

As you can see above, I don’t make a request per se through the event queue. Instead, I’m just providing memory locations where the virtio system can place an event. This will happen every time the device consumes an event buffer, so I put it in a function that will be called over and over again.

Let’s take a look at what happens when the virtio input device interrupts us to tell is something happened.

let ref desc = queue.desc[elem.id as usize];
let event = (desc.addr as *const Event).as_ref().unwrap();
repopulate_event(dev, elem.id as usize);
dev.event_ack_used_idx = dev.event_ack_used_idx.wrapping_add(1);
match event.event_type {
	EventType::Abs => {
		let mut ev = ABS_EVENTS.take().unwrap();
		ev.push_back(*event);
		ABS_EVENTS.replace(ev);	
	},
	EventType::Key => {
		let mut ev = KEY_EVENTS.take().unwrap();
		ev.push_back(*event);
		KEY_EVENTS.replace(ev);	
	},
	_ => {
	}
}

The code above is a snippet of the whole function, but I want to highlight how we handle the event. We get an interrupt, and the used ring tells us which buffer was just used–for lack of a better term. We can then look at this buffer and retrieve the event. Notice that these events get pushed back. I use (*event) to copy the event instead of grabbing a reference to the event. Again, this is because we reuse the buffers in the virtio available ring.

Also, notice that any event not supported by our driver is silently dropped. This is how the specification tells us to handle unknown events.

One particular piece of code I’d like to draw your attention to is repopulate_buffer. We reuse the memory space over and over again so that we don’t have to keep freeing and reallocating the buffer. However, the event system requires that the next available ring contain a descriptor with a memory address–so this is a dog chasing its tail. It isn’t difficult to do, we just have to put the exact same memory address back into the descriptor and point to it through the available ring.

Finally, notice that the event queues KEY_EVENTS and ABS_EVENTS push the event back onto it. Since we will be reusing the buffer, I made events clonable and copyable by using #[derive(Clone, Copy)].


Event Queues

Input events will generate an interrupt. Therefore, the operating system must queue events so that whenever the userspace program is ready, it can hand those events off to it. I chose to use a VecDeque due to its O(1) inserts and removals (amortized). We can also use push_back to put an event on the queue and pop_front to pull an event off.

Our event queues must be global to stay resident in the OS:

pub static mut ABS_EVENTS: Option<VecDeque<Event>> = None;
pub static mut KEY_EVENTS: Option<VecDeque<Event>> = None;

This should look familiar if you’ve seen the process queue. We use the Option sum type so that we can have something set at compile time. Now, to create the queues, we have to judge the initial capacity:

ABS_EVENTS = Some(VecDeque::with_capacity(100));
KEY_EVENTS = Some(VecDeque::with_capacity(15));

I used a little bit of creative license for these event queues. When I moved my mouse, it generated nearly 100 events in about a second. My keyboard was about 2 to 3 keys per second. I think 100 and 15 will keep us from having to resize without wasting too much memory.

If we exceed the capacity of the event queue, the VecDeque will grow the ring buffer. However, we don’t place a check on it to avoid exhausting our memory. This is probably a very bad idea, but for now, it suffices. The only way, right now, to drain the event buffer is for a user space application to make a system call. So, if no user space application cares about events, we could easily overfill the buffer.


System Calls

The whole point here is to make user space applications recognize events. At first, I wrote a system call for abs and one for key that would retrieve the top event and return it. Due to the sheer number of mouse events, this proved to be VERY slow–painfully slow in fact. So, I changed the system call to download an array of events, which I will show you below.

Right now, I just assigned numbers 1002 and 1004 for these system calls. More realistically, we would use the open() and read() system calls to open a mouse or keyboard device. However, for now, we can at least see how the system works.

let mut ev = KEY_EVENTS.take().unwrap();
let max_events = (*frame).regs[Registers::A1 as usize];
let vaddr = (*frame).regs[Registers::A0 as usize] as *const Event;
if (*frame).satp >> 60 != 0 {
	let process = get_by_pid((*frame).pid as u16);
	let table = ((*process).get_table_address() as *mut Table).as_mut().unwrap();
	(*frame).regs[Registers::A0 as usize] = 0;
	for i in 0..if max_events <= ev.len() { max_events } else { ev.len() } {
		let paddr = virt_to_phys(table, vaddr.add(i) as usize);
		if paddr.is_none() {
			break;
		}
		let paddr = paddr.unwrap() as *mut Event;
		*paddr = ev.pop_front().unwrap();
		(*frame).regs[Registers::A0 as usize] += 1;
	}
}
KEY_EVENTS.replace(ev);

First, we need to grab the user process’ table (if one exists). Instead of returning a single 8-byte event, we are now given a memory address to fill events. This is also why we need to keep running virt_to_phys for every memory address we get. It is possible that the user space application exceeds their memory space, so we want to make sure we don’t crash inside of our kernel.

By user space, we are given a buffer in A0 and a maximum number of events in A1. Since each event is 8 bytes, the buffer should be at least \(\text{max number of events}\times 8\) bytes. When we return, A0 will be filled with the number of events that the OS was able to put into the array.

In user space, our system call would look something like the following:

struct Event {
	u16 event_type;
	u16 code;
	u32 value;
} events[MAX_EVENTS];
//...
if ((num_events = syscall_get_key(events, MAX_EVENTS)) > 0) {
	for (u32 z = 0;z < num_events;z++) {
		Event &ev = events[z];
		switch (ev.code) {

So, events is an array of Event structures, and MAX_EVENTS is the size of the array (currently, I set it to 100). Notice that our system call will NOT block if there are no events. This is contrary to something like the select system call, where we can set it to block or not block.

In our function above, we process the events off of the event array. We really don’t need the if statement, since the for loop wouldn’t execute anyway if we didn’t have events, but I realized that when I copied the code into this blog post :(.

After I changed the system call to receive an array of events, we get a fully supported input system complete with key presses and mouse events.

ABS Events Caveat

Remember that we get an absolute event as either ABS_X or ABS_Y. That means we are required to keep up with the x or the y coordinates. This is fairly straight forward, but it threw me for a loop the first time I encountered it. I thought that the value contained both the X and Y elements, like the X11 display server does. However, that is not the case. We get one coordinate per event. So, to keep track we do:

if ((num_events = syscall_get_abs(events, MAX_EVENTS)) < 1) {
	syscall_sleep(noevt_slptm);
	continue;
}
for (u32 z = 0;z < num_events;z++) {
	Event &ev = events[z];
	if (ev.code == ABS_X) {
		x = lerp(ev.value & 0x7fff, 32767, 640);
	}
	else if (ev.code == ABS_Y) {
		y = lerp(ev.value & 0x7fff, 32767, 480);
	}
	if (pressed) {
		fill_rect(fb, x, y, 5, 5, current_color);
	}
}

So, we have a variable x and a variable y that will keep track of where our mouse is moving. We also need to convert the coordinate from a space of 0 through 32767 to a space of 0 through 640 (width or x coordinate) or a space of 0 through 480 (height or y coordinate). That’s the purpose of the lerp (linear interpolation) function. It isn’t really a linear interpolation function in the nominal sense, but it does the trick.

Notice that if there are no events, I put the process to sleep. For many systems, this is how we event process. Instead of having the message queue block, we do it. This is so that we can update graphics or do whatever we want while there are no events. Since our system call doesn’t ever block, we sort of have to give the CPU a breather by putting our process to sleep.


Final Product

Let’s see how all of these systems work together to make a drawing application.

Screen capture of our GPU/event system.

You can see that our event input only has so many inputs per time unit. That is why when I move the mouse quicker, we get more distance per square. Also, I do not do anything to connect the squares, which is why each square is a discrete event. Finally, I can change colors by typing B (blue), O (orange), R (red) as shown in the event handler in user space:

switch (ev.code) {
		case BTN_MOUSE:
			pressed = (ev.value & 1) == 1;
		break;
		case KEY_O:
			current_color = Pixel {255, 150, 0, 255};
		break;
		case KEY_B:
			current_color = Pixel {0, 0, 255, 255};
		break;
		case KEY_G:
			current_color = Pixel {0, 255, 0, 255};
		break;
		case KEY_R:
			current_color = Pixel {255, 0, 0, 255};
		break;
		case KEY_W:
			if (ev.value == 0) { //released
				fill_rect(fb, 0, 0, 640, 480, white_color);
				syscall_inv_rect(6, 0, 0, 640, 480);
			}
		break;
		case KEY_Q:
			if (ev.value == 0) { // released
				fill_rect(fb, 0, 0, 640, 480, black_color);
				syscall_inv_rect(6, 0, 0, 640, 480);
			}
		break;
	}
}

Each event is discrete, so to keep a state, we need to track it ourselves, which is where our x and y coordinates come from as well as the pressed variable–which states whether the left mouse button is up (false) or down (true). We can use that to only draw when the mouse is held down:

if ((num_events = syscall_get_abs(events, MAX_EVENTS)) < 1) {
	syscall_sleep(noevt_slptm);
	continue;
}
for (u32 z = 0;z < num_events;z++) {
	Event &ev = events[z];
	if (ev.code == ABS_X) {
		x = lerp(ev.value & 0x7fff, 32767, 640);
	}
	else if (ev.code == ABS_Y) {
		y = lerp(ev.value & 0x7fff, 32767, 480);
	}
	if (pressed) {
		fill_rect(fb, x, y, 5, 5, current_color);
	}
}
if (pressed) {
	syscall_inv_rect(6, 0, 0, 640, 480);
}

We still have to keep up with the x and y coordinates every time we get an absolute event, but we only fill a rectangle and invalidate the screen when we have the mouse pressed down. Otherwise, the cursor just moves with nothing drawn to the screen.

As always, please head on over to my GitHub to look at the source for everything I discussed in this post.


Additional Reading

Dynamic Linking

Static vs dynamic linking usually has to do with our tolerance for the size of our final executable. A static executable contains all code necessary to run the executable, so the operating system loads the executable into memory, and it’s off to the races. However, if we keep duplicating code over and over again, such as printf, then it starts to use up more and more space. So, a dynamic executable means that we only store stubs in the executable. Whenever we want to access printf, it goes out to a dynamic linker and loads the code essentially on demand. So, we sacrifice a tiny bit of speed for a much smaller executable.


Contents

  1. What is Linking
  2. Finding Symbols
  3. Static Libraries (archives)
  4. Dynamic Libraries (shared objects)
  5. Analyzing a Shared Program
  6. Unresolved Symbols at Run Time
  7. Procedure Linkage Table (plt)
  8. Global Offset Table (got)
  9. Loading and Updating the GOT
  10. Conclusion and Further Reading
  11. Video

What is Linking?

When I hear someone talk about compiling their program into an executable, they are really eliding over several stages. The definition of compiling is to produce something by combining information collected from different sources. In computing, we generally think of compiling as turning a higher-level language, such as C, into a lower-level code, such as assembly.

The final stage before we get an executable is the linking stage. This is where we link (hence the name) all sources together to produce one coherent executable. This is also where all outstanding symbols need to be resolved. Symbol is just fancy for the name of a function or global variable.

We can take a look at object code, which is what we get after we assemble but before we link. Here’s an object dump of an example program.

#include <stdio.h>
#include <stdlib.h>
void some_func(int a, int b);

int main(int argc, char *argv[]) {
	if (argc < 3) {
		printf("Not enough arguments.\n");
		return -1;
	}

	int a = atoi(argv[1]);
	int b = atoi(argv[2]);

	some_func(a, b);

	return 0;
}

This code produces the following object code, which I disassemble using objdump for RISC-V.

Notice that the function some_func has been prototyped, but has not been defined. This will be the responsibility of the linker to find the symbol some_func and add it into our program. Notice what happens when I try to link this program without ever defining some_func.

/opt/riscv_1/lib/gcc/riscv64-unknown-linux-gnu/9.2.0/../../../../riscv64-unknown-linux-gnu/bin/ld: /tmp/ccmgFc75.o:

in function .L2': test.c:(.text+0x62): undefined reference to some_func'
collect2: error: ld returned 1 exit status

The linker is looking for the symbol some_func, but it cannot find it, so we get an undefined reference. We know this is at the linker stage because the error is “ld returned 1 exit status”. The “ld” means “linker”.

We also can see that the address of the function main is 0, this is because we haven’t linked a program. So, our object code just contains bubbles of code, which will then be placed into our executable at certain locations by the linker.


Finding Symbols

If we use the command nm, which is used to list symbols in an object file, we can see all of the unresolved symbols. Our object code is looking for these, but it doesn’t need to know where they are until we have a full executable program.

You can see that in this object file, the linker has a little bit of work to do. It must find atoi, puts, and some_func, which are flagged as U for undefined symbols. When we execute the linker, we will specify certain libraries, such as -lc (the C library), which most of these symbols will be found. Our some_func has never been defined, so our linker cannot really succeed until we define it somewhere.


Static Libraries (archives)

Archive files generally end in .a and contain code and other information that will be added to the final executable. Essentially, archive files are just a collection of object files into one file. Therefore, when we link to an archive file, we extract the actual code from the object code into our executable. The nice thing about archive files is that we don’t need to add ALL symbols into our executable–we only need those symbols that need to be used.

Usually, we have archive versions of all libraries to allow for static executables. These executables are those that don’t need any additional loading to run. In other words, these are self-contained files. The linker stage will pull in the code directly from the .a file into the executable, and all is done. The more code that the linker pulls in, the larger the executable will be.

Let’s go ahead and define some_func and see how linking works.

#include <stdio.h>

void some_func(int a, int b)
{
	printf("Data = %d\n", a * b);
}

It’s very simple, but the point is to have some code that the linker can pull in. Recall that the linker gave us an “undefined reference” error because we didn’t define some_func. Now, we have it defined, so let’s see what happens.

riscv64-unknown-linux-gnu-gcc -static -o test test2.o test.o

Notice that I’m using gcc still. This will automatically invoke the linker and pull in the necessary start files. If we invoke the linker directly, we have to define _start and tell the program how to start, or we would have to directly specify the start files.

I specify -static so that gcc will pull in only .a files. When we are done, the file test will be a fully-contained, static executable. If we take a look at this file, we can see that all of the “stuff” we need for this executable makes it a fairly large file.

-rwxr-xr-x 1 smarz smarz 4687920 Jun 1 09:16 test

Yes, that’s 4,687,920 bytes or about 4.7 megabytes. However, if we looked at the symbol table, we will find that NO symbol is unresolved, and therefore this is a self contained executable. If we load this with our elf loader, no external resources will need to be pulled in.

Our linker must exhaustively follow every possible route and pull in those symbols even though they may never be called. We can see the symbol table is enormous due to all of the calls and global variable (such as errno).

00000000000101b0 T abort
000000000006bb38 S __abort_msg
0000000000029aa8 t add_alias2.isra.0.part.0
00000000000497c6 t add_fdes
00000000000297ea t add_module.isra.0
000000000003aa4e t add_name_to_object.isra.0
000000000003ab5c t add_path.isra.0.constprop.0
000000000006b8d0 d adds.8114
000000000002fe2e T __add_to_environ
00000000000431b6 t add_to_global
000000000001adb4 t adjust_wide_data
000000000006bba0 V __after_morecore_hook
0000000000012d3a t alias_compare
000000000002338a W aligned_alloc
000000000006bb80 s aligned_heap_area
000000000003ff04 t allocate_dtv
000000000003a40c T __alloc_dir
000000000006ca38 b any_objects_regis

Dynamic Libraries (shared objects)

Dynamic libraries end in .so, which stands for shared object. These libraries contain code that will not be added directly into the executable. Instead a program called the dynamic linker will be responsible for taking code from the .so files and adding them into an executing program. We can also add symbols ourselves using the -ldl (dynamic linker) library.

We can think of the term dynamic as run-time. That is, we don’t actually load the code into our program until the program actually runs. We can see this with something as simple as printf. We can examine our executable and we don’t see printf’s code. Instead, we see printf’s stub. This stub will then be replaced by the dynamic loader when the program executes.

When we link with a dynamic, shared object, those symbols that can be added at run time will remain unresolved. We will have the symbols’ names put into a table just so we know it’s out there somewhere. However, with shared objects, we now can get that unresolved reference (or symbol) at run-time! If you have arch-linux and have ever compiled anything yourself, you might’ve run into this phenomenon.


Making a Shared Library

Let’s go ahead and turn our test2 file into a shared object. This is fairly easy to do with gcc:

riscv64-unknown-linux-gcc -fPIC -shared -o libtest2.so test2.o

The switch -fPIC stands for position independent code. That means all offsets cannot be relative outside of the library itself, and this is usually the case for most shared libraries. In this case, the generated code will be placed into a table to find the offsets. This is required since the library, or specific symbols in the library, can be loaded into any location by the dynamic linker.

Then, we can compile our test program to link to this shared library:

riscv64-unknown-linux-gcc -o test test.o -L. -ltest2

In the command above, I am specify the library search path with -L, so -L. means to look in the current directory. Then, I specify the library to link with using -ltest2. GCC will automatically prepend lib and append .so to make libtest2.so.


Analyzing A Shared Program

First, we compile in the location of our library. We can see these shared libraries using the ldd command.

smarz@DOMO:~ $ ./test
./test: error while loading shared libraries: libtest2.so: cannot open shared object file: No such file or directory

So, when I run my program, it looks at a certain path to find your libraries. This is analogous to the PATH environment variable, except in Linux (and some UNIXes), we use LD_LIBRARY_PATH. Shown below, if I change my path so the dynamic linker can find my library, it functions properly:

smarz@DOMO:~ $ LD_LIBRARY_PATH=/home/smarz ./test 10 20
Data = 200

Unresolved at Run Time

When we link a program, many of the symbols will remain unresolved. This tells us what symbols the dynamic linker is responsible for loading from the shared objects (libraries). We can see these unresolved symbols using the nm command:

Partial list of symbols in our test program

You can see that puts is unresolved (capital U letter), and so is some_func, since we now put it in a shared library. These symbols have stubs that will eventually invoke the interpreter to load the actual code for that function.

The dynamic linker’s job is to resolve these symbols when we approach them. In fact, we can take a look at the produced ELF executable to see what it wants to use to resolve symbols.

You can see the INTERP section (interpreter), which requests the dynamic linker by name. If we run this dynamic linker, we can see its purpose.

To resolve these symbols, the linker will put in a loader in place of the actual function. When we run our program, some_func’s assembly instructions are no where to be found in our program. Instead, when we make a function call to some_func, it will refer us to the procedure linkage table (PLT). The procedure linkage table will then lookup where the program is loaded by referring to the global offset table (GOT). At first, the address in the GOT is the code that loads the program (the dynamic linker). After the dynamic linker loads our some_func code, it changes the GOT to reflect where it put some_func in memory.

A flow chart representing dynamically linked functions.

Procedure Linkage Table (PLT)

If we take a look at the functions and other symbols we’re going to use in our program, we notice that they’re very, very short, and they look almost identical to each other. This is because we have a stub, which loads the program from the procedure linkage table, PLT.

Our some_func loads from -1028(t3) and jumps to that location. The value of t3 is the global offset table (GOT), which I cover below. We can see that -1028(t3) is the address 12018, which is right in the meat of the GOT.

So, we go to the function, which then loads from the global offset table, which then points us to the procedure linkage table. In this procedure linkage table, if the procedure has been used before, the code for that procedure, such as puts or some_func, will be loaded into the table. Otherwise, it will be a stub function whose job it is to invoke the interpreter and load the code.


Global Offset Table (GOT)

The global offset table is a large table that contains offsets that we can load from. Here is the table in our test program.

Don’t worry about the assembly instructions–these aren’t instructions, but objdump is trying to disassemble them anyway.


Loading and Updating the GOT

Notice that all functions originally point to 0x0000_0000_0001_04a0, which goes back to the procedure linkage table. This is where the interpreter routine is located which loads the symbols. When the symbol is loaded by the interpreter, it will modify the global offset table to reflect where it loaded it into memory. We can look at this by directly referencing the memory being used.

Let’s modify our program to see what we get here:

#include <stdio.h>
#include <stdlib.h>
void some_func(int a, int b);

int main(int argc, char *argv[]) {
	if (argc < 3) {
		printf("Not enough arguments.\n");
		return -1;
	}

	int a = atoi(argv[1]);
	int b = atoi(argv[2]);
	volatile unsigned long *ptr = (unsigned long *)0x12018;

	printf("0x%016lx = 0x%016lx\n", ptr, *ptr);

	some_func(a, b);

	printf("0x%016lx = 0x%016lx (some_func is at 0x%016lx)\n", ptr, *ptr, some_func);

	return 0;
}

In our modified program, we look at 0x12018, which was where our some_func stub was located in the global offset table (remember some_func@plt referred us to this location in the GOT).

Notice that at first the global offset table points to 0x104e0, which is the code used to invoke the interpreter which in turn loads the symbols. Notice that when we load some_func, the global offset table has been updated with the memory address 0x400_0081_b460. This is where our dynamic linker loaded the code for our some_func function.

Since I updated the function, some of the offsets are no longer what they were for our original program. However, to show you that the some_func function itself points to the procedure linkage table, let’s take a look at the updated object dump:

When I print out some_func’s memory address, we get 0x10510, which is where it is located in the procedure linkage table. However, if we look at the table, notice that it grabs the address of the actual function from the global offset table. In this case, the stub is at 0x104e0. After some_func is loaded (after its first invocation), the global offset table is updated to reflect 0x400081b460, which is where the actual assembly instructions are located for some_func.


Conclusion

If we break down what’s happening here, there is no magic happening. For dynamic symbols, we just have a table that stores their addresses. The table’s location is fixed, so we can compile that directly into our program. However, what the table refers to is dynamic, so it can be updated by the dynamic linker ld.so with wherever it put our code. That’s it!


Video