Why did I catch s*** for my use of the RISC-V SFENCE.VMA instruction?

Why did I catch s*** for my use of the RISC-V SFENCE.VMA instruction?

This post is part of a longer OS tutorial which can be found here: https://osblog.stephenmarz.com

Contents

  1. Introduction
  2. What is SATP?
  3. What is SFENCE.VMA?
  4. What is happening?
  5. The Translation Lookaside Buffer
  6. Conclusion
  7. References

Introduction

My last post garnered some attention by those telling me that I “forgot” to execute an SFENCE.VMA after I wrote to the SATP register–some with more tact than others. This post is here to clarify why I did what I did, and to clarify what the specification actually tells us needs to be done.


What is SATP?

The supervisor address translation and protection (SATP) register is a register that tells the MMU what mode it’s in, what address space it is working in, and where to find the first level page table in RAM (this would be level 2 for Sv39).

The SATP Register Fields

The SATP register stores three pieces of information, the MODE, the address space identifier (ASID), and the physical page number (PPN).

The MODE

If the MODE=0, then the MMU is turned off and any address is not translated.

If MODE=8, we’re in Sv39 (Supervisor 39-bit) mode which means that our virtual addresses are 39-bits long.

There are other modes, but I won’t cover them here.

The ASID

The address space identifier tags translations with a unique identifier.

The PPN

The physical page number is the upper 44-bits of a 56-bit memory address where we can find the first level page table.


What is SFENCE.VMA?

In absolute terms, this means supervisor fence.virtual memory address. In real terms, it will flush the cache’s so that the MMU will “see” the new changes in memory. This means that the MMU will be forced to look in RAM where the page tables are stored.

The SFENCE.VMA instruction has several different ways to execute it as shown in the specification. This should clue us in to the fact that executing SFENCE.VMA every time we write to SATP might not be so cut and dry.


What is Happening?

So, why is this not straightforward? The issue is that “walking” a page table–meaning going into RAM and translating a virtual address into a physical address–is not a fast process. There are multiple levels of page tables, and several 8-byte entries that need to be dereferenced by the memory controller.

We can speed these walks up by “caching” frequently translated addresses into a table. The table has the virtual address number and the physical address number, so translation is just a table lookup instead of dereferencing several levels of page tables.

This caching can be speculative. If the MMU doesn’t speculate, then the first time we translate an address, that address will not be in the TLB (the cache table) and we will get what is known as a compulsory miss. If we speculate, we can predict the addresses that will be used and when the memory controller isn’t doing anything else, we can load the virtual address and physical address into cache.

This speculation is one of the reasons for SFENCE.VMA. Another reason is due to the fact that when we translate a page using the MMU, it stores the most recent translations in the TLB as well.


The Translation Lookaside Buffer

The translation lookaside buffer or TLB is a fancy term for the MMU cache. It stores the most recent translations to exploit temporal locality–that is, the chances we’re going to translate the same address near in the future is likely. So, instead of having to walk the page tables all over again, we just look in the TLB.

The TLB has several entries, and with RISC-V, it stores an address space identifier (ASID). The address space identifier allows the TLB to store more entries than just the most recent page table. This has always been a problem with TLBs, including with the Intel/AMD processor. Writing to its MMU register (called CR3 for control register #3) will cause a TLB flush. This is NOT the case with RISC-V writing to the SATP register (the MMU register in RISC-V).

The specification just gives the general rules for a manufacturer to use. Therefore, the manufacturer can choose how they want to implement their MMU and TLB as long as it complies with RISC-V’s privileged specification. Here’s a simple implementation of a TLB that complies with the privileged specification.


Conclusion

The RISC-V specification doesn’t make it very clear, but you can see clarification on the spec’s github repository. If the MODE is not 0, then the MMU is allowed to speculate, meaning it can pre-populate the MMU based on the addresses it thinks will need to be translated in the near future. The specification allows this, but the MMU cannot throw a page fault if a speculatory translation is invalid.

So, bottom line — SFENCE.VMA should NOT be called every time SATP is changed. This will cause TLB thrashing since every time you context switch, you will need to change the SATP register to the kernel page table, schedule, then change the SATP register to the new scheduled process’ page table.

Instead, the SFENCE.VMA instruction should be invoked when one or more of the following occur:

  1. When software recycles an ASID (i.e., reassociates it with a different page table), it should first change satp to point to the new page table using the recycled ASID, then execute SFENCE.VMA with rs1=x0 and rs2 set to the recycled ASID. Alternatively, software can execute the same SFENCE.VMA instruction while a different ASID is loaded into satp, provided the next time satp is loaded with the recycled ASID, it is simultaneously loaded with the new page table.
  2. If the implementation does not provide ASIDs, or software chooses to always use ASID 0, then after every satp write, software should execute SFENCE.VMA with rs1=x0. In the common case that no global translations have been modified, rs2 should be set to a register other than x0 but which contains the value zero, so that global translations are not flushed.
  3. If software modifies a non-leaf PTE, it should execute SFENCE.VMA with rs1=x0. If any PTE along the traversal path had its G bit set, rs2 must be x0; otherwise, rs2 should be set to the ASID for which the translation is being modified.
  4. If software modifies a leaf PTE, it should execute SFENCE.VMA with rs1 set to a virtual address within the page. If any PTE along the traversal path had its G bit set, rs2 must be x0; otherwise, rs2 should be set to the ASID for which the translation is being modified.
  5. For the special cases of increasing the permissions on a leaf PTE and changing an invalid PTE to a valid leaf, software may choose to execute the SFENCE.VMA lazily. After modifying the PTE but before executing SFENCE.VMA, either the new or old permissions will be used. In the latter case, a page fault exception might occur, at which point software should execute SFENCE.VMA in accordance with the previous bullet point.

Unfortunately, you have to dig through the issues and updates to the specification on GitHub to find out some of this information. I have provided links in references below.


References