Education

RVE - RISC-V Emulator

A deep dive into instruction decoding, CSRs, traps, ELF loading, Sv32 MMU, and Linux boot in a ~1000-line C++ RISC-V 32b emulator

RVE - RISC-V Emulator

Overview

RVE is a RISC-V emulator written in C++ that boots Linux 6.1.14 on a rv32nommu kernel. This post walks through the full source - from how a 32-bit instruction word is parsed, through CSR shadowing and trap delegation, all the way to the Sv32 MMU two-level page walk and Linux boot setup.

Despite being roughly 1000 lines of C++, the emulator is complete enough to run a real Linux kernel with a BusyBox userspace, handle supervisor/machine-mode privilege transitions, respond to timer and UART interrupts, and optionally render a framebuffer via SDL2.

Prototype

Try It in Your Browser

The emulator has been compiled to WebAssembly and WebGL and runs entirely in-browser - no install required.

→ Live Demo


Part 1 - System Architecture

Before diving into any individual subsystem, it helps to understand how the pieces fit together.

Source Layout

The emulator is split across five source files:

FileResponsibility
rv32.h / rv32.cppCPU state, memory access, CSRs, traps, CLINT, UART
emu.cppInstruction decode and dispatch (insSelect), instruction implementations
loader.cppELF loading and Linux flat image loading
main.cppEntry point, headless vs. GUI mode
app.cppSDL2 + ImGui GUI shell

The CPU State

The CPU is modeled as a single RV32 class:

class RV32 {
  u32 clock;           // cycle counter
  u32 xreg[32];        // x0 – x31
  u32 pc;              // program counter
  u8  *mem;            // 128 MiB flat RAM
  csr_state csr;       // 4096 × u32
  clint_state clint;   // timer
  uart_state uart;     // serial
  bool reservation_en; // LR/SC
};

RAM is a single flat uint8_t heap. Address bit 31 distinguishes RAM from MMIO - any address with bit 31 set is RAM, everything below is a peripheral register. The physical offset into the 128 MiB buffer is simply addr & 0x7FFFFFFF.

Reset state: pc = 0x80000000, all xreg = 0, x11 = 0x1020 (DTB pointer for Linux), privilege = PRIV_MACHINE (3).

Physical Memory Map

0x00001020 – 0x00001FFF   DTB blob
0x02000000 – 0x0200BFFF   CLINT
0x10000000 – 0x10000007   UART 16550
0x80000000 – 0x87FFFFFF   128 MiB RAM

memGetByte first tests addr & 0x80000000. MMIO addresses fall into a switch table; all other accesses resolve directly to mem[addr & 0x7FFFFFFF]. No DMA, no PCI, no GPU - just CLINT and UART. That's all Linux needs for a console boot.


Part 2 - Instruction Encoding

The 6 RISC-V Instruction Formats

All RISC-V instructions are 32 bits wide and fall into one of six encoding formats. A key design principle: rs1 is always at [19:15] and rs2 at [24:20], so the register file can be read before the opcode is fully decoded.

FormatKey FieldsUsed By
Rfunct7 + rs2 + rs1 + fn3 + rd + opcodeadd sub mul div and or xor sll sra
Iimm[11:0] + rs1 + fn3 + rd + opcodeaddi lw jalr ecall csrrw
Simm[11:5] + rs2 + rs1 + fn3 + imm[4:0] + opcodesw sh sb
Bscattered imm bits + rs2 + rs1 + fn3 + opcodebeq bne blt bge bltu bgeu
Uimm[31:12] + rd + opcodelui auipc
Jscrambled imm[20:1] + rd + opcodejal

The S-type and B-type formats split the immediate across two fields to keep rs1/rs2 at their fixed positions - stores have no destination register, so the bits that would be rd are reused for the low immediate bits.

B-type branches reassemble bits from four separate locations: {imm[12], imm[11], imm[10:5], imm[4:1], 0}. J-type (JAL) is the most scrambled: {imm[20], imm[19:12], imm[11], imm[10:1], 0}. The scrambling is intentional - it maximizes bit overlap with other formats to reduce the critical path in hardware decoders.

Format Parsers - Bit Extraction

All six format parsers run at the top of insSelect() before any dispatch. The compiler eliminates dead computations for formats not used by the matched instruction:

FormatR parse_FormatR(u32 word) {
  FormatR ret;
  ret.rd  = (word >>  7) & 0x1f;
  ret.rs1 = (word >> 15) & 0x1f;
  ret.rs2 = (word >> 20) & 0x1f;
  ret.rs3 = (word >> 27) & 0x1f;
  return ret;
}

B-type immediates require sign-extension by checking the MSB and OR-ing in the sign mask:

ret.imm =
  (word & 0x80000000 ? 0xfffff000 : 0)
  | ((word << 4)  & 0x00000800)  // bit 11
  | ((word >> 20) & 0x000007e0)  // bits 10:5
  | ((word >> 7)  & 0x0000001e); // bits 4:1

Part 3 - Instruction Decode and Dispatch

insSelect() - Multi-Stage Masked Dispatch

A single opcode byte isn't enough to fully identify a RISC-V instruction. Different instructions live at different opcode/funct3/funct7 combinations, so insSelect() uses seven progressive mask stages:

MaskBits ExposedExamples
0x0000007fopcode onlylui jal auipc
0x0000707fopcode + funct3addi lw beq csrrw
0xf800707fAMO opsamoswap amoadd
0xfc00707fshift immediatesslli srli srai
0xfe00707fR-type arithmeticadd sub and mul
0xfe007fffsfence.vma-
0xffffffffexact matchecall mret

Each stage is a switch over the masked instruction word. A match returns immediately; if no case fires, the next mask is applied and the next switch runs. CSR instructions get a pre-read before any dispatch - if the instruction looks like a CSR op ((ins_word & 0x73) == 0x73), the current CSR value is read into ins_FormatCSR.value so instructions receive the old value atomically.

The imp/run Macro DSL

Instruction implementations are defined with imp and registered with run:

// imp: define an instruction handler
#define imp(name, fmt_t, code)
  void Emulator::emu_##name(u32 w, ins_ret *ret, fmt_t ins) { code }
 
// run: case label + dispatch + early return
#define run(name, opcode, insf)
  case opcode:
    if (debugMode) ins_p(name)
    emu_##name(ins_word, &ret, insf);
    return ret;

Result helpers write into the ins_ret struct:

#define WR_RD(code)  { ret->write_reg = ins.rd; ret->write_val = AS_UNSIGNED(code); }
#define WR_PC(code)  { ret->pc_val = code; }
#define WR_CSR(code) { ret->csr_write = ins.csr; ret->csr_val = code; }

Implementations read like inline specifications:

imp(add, FormatR, {
  WR_RD(AS_SIGNED(cpu.xreg[ins.rs1]) + AS_SIGNED(cpu.xreg[ins.rs2]));
})
 
imp(beq, FormatB, {
  if (cpu.xreg[ins.rs1] == cpu.xreg[ins.rs2])
    WR_PC(cpu.pc + ins.imm);
})
 
imp(amoswap_w, FormatR, {  // rv32a atomic swap
  u32 tmp = cpu.memGetWord(cpu.xreg[ins.rs1]);
  cpu.memSetWord(cpu.xreg[ins.rs1], cpu.xreg[ins.rs2]);
  WR_RD(tmp)
})

AS_SIGNED / AS_UNSIGNED reinterpret bits without conversion via a pointer cast, avoiding undefined behavior.

ins_ret - The Result Bus

Instructions don't modify CPU state directly. They populate an ins_ret struct; emulate() commits all side-effects after insSelect() returns:

typedef struct {
  u32 write_reg; // rd index
  u32 write_val; // rd value
  u32 pc_val;    // next PC
  u32 csr_write; // CSR addr
  u32 csr_val;   // CSR value
  Trap trap;     // exception
} ins_ret;

insReturnNoop() zeros the struct and sets pc_val = pc + 4. Handlers only populate fields they affect. write_reg == 0 is silently discarded at commit time - x0's hard-wired zero is enforced without special-casing inside individual implementations.

After commit, handleIrqAndTrap(&ret) checks whether the instruction raised a trap and whether any interrupts are pending and enabled.


Part 4 - Privileged Architecture

Privilege Levels

Three privilege levels are implemented: Machine (3), Supervisor (1), User (0). The emulator boots in Machine mode.

#define PRIV_USER       0
#define PRIV_SUPERVISOR 1
#define PRIV_MACHINE    3
// stored in csr.privilege, changed by traps and xRET

mret / sret restore privilege from MSTATUS.MPP / SSTATUS.SPP and re-enable interrupts via MPIE → MIE. Delegation registers MIDELEG / MEDELEG control which privilege level handles each trap - typically, Linux runs with most traps delegated to S-mode so the kernel handles them without bouncing through M-mode firmware.

CSR Architecture

The CSR file is a flat 4096-entry u32 array. The address itself encodes privilege requirements:

  • Bits [9:8] - minimum privilege level required to access
  • Bits [11:10] == 0b11 - read-only; writes trap with IllegalInstruction
typedef struct {
  u32 data[4096];  // all CSRs by addr
  u32 privilege;   // current mode
} csr_state;
 
bool hasCsrAccessPrivilege(u32 addr) {
  u32 req = (addr >> 8) & 0x3;
  return req <= csr.privilege;
}

Key CSRs and their roles:

AddressNamePurpose
0x300MSTATUSGlobal IE, MPP/SPP fields
0x303MIDELEGDelegate interrupts to S-mode
0x304MIEMachine interrupt enable
0x305MTVECTrap handler address
0x341MEPCReturn address after trap
0x342MCAUSETrap cause code
0x344MIPInterrupt pending

Shadow Registers - SSTATUS ⊂ MSTATUS

SSTATUS, SIE, and SIP have no backing storage. They are computed on read as masked views of their M-mode counterparts:

u32 readCsrRaw(u32 addr) {
  switch (addr) {
  case CSR_SSTATUS:
    // SSTATUS is MSTATUS masked to S-visible bits only
    return csr.data[CSR_MSTATUS] & 0x000de162;
  case CSR_SIE:
    return csr.data[CSR_MIE] & 0x222;
  case CSR_SIP:
    return csr.data[CSR_MIP] & 0x222;
  case CSR_CYCLE: return clock;
  case CSR_TIME:  return clint.mtime_lo;
  default:
    return csr.data[addr & 0xffff];
  }
}

The 0x000de162 mask exposes only the SSTATUS-legal fields of MSTATUS: SD, MXR, SUM, XS, FS, SPP, SPIE, SIE. M-mode fields (MPP, MPIE, MIE) are invisible to S-mode. The 0x222 mask (0b001000100010) exposes only the supervisor-visible interrupt bits within MIE/MIP: SEIP (9), STIP (5), SSIP (1).

Writes to SSTATUS/SIE/SIP merge back into the M-mode registers using the same masks - there is one source of truth.


Part 5 - Interrupts and Traps

Trap Handling

handleIrqAndTrap() runs after every instruction commit. It first checks for a synchronous trap from the instruction just executed, then scans MIP & MIE for pending interrupts. Synchronous traps take priority. IRQ scan order: MEIP → MSIP → MTIP → SEIP → SSIP → STIP - first match wins.

handleTrap() performs the full trap entry sequence:

  1. Determine target privilege - check MIDELEG/MEDELEG to see if the trap is delegated to S-mode
  2. Write trap registers - xEPC = pc, xCAUSE = type, xTVAL = bad address or instruction
  3. Jump to handler - read TVEC; if vectored mode (TVEC[1:0] != 0), jump to base + 4 × cause
  4. Update MSTATUS - MIE → MPIE, MIE = 0, current privilege → MPP, set new privilege

mret reverses step 4: MPIE → MIE, MPP → privilege.

CLINT - Core-Local Interruptor

The CLINT provides a 64-bit memory-mapped timer and a software interrupt register:

0x02000000  MSIP        (software interrupt)
0x02004000  MTIMECMP lo
0x02004004  MTIMECMP hi
0x0200BFF8  MTIME lo
0x0200BFFC  MTIME hi

Timer interrupt flow: the OS writes MTIMECMP = MTIME + period. The emulator increments MTIME each cycle. When MTIME >= MTIMECMP, it sets MIP.MTIP = 1. On the next handleIrqAndTrap() call with MIE.MTIE set, a MachineTimerInterrupt trap fires. This is how Linux implements its scheduler tick.

UART - Serial Interrupt Path

The UART 16550 is mapped at 0x10000000. Eight registers are packed into two u32 fields accessed via UART_GET1/2 shift macros. The IIR update rule:

  • RBR != 0 && IER.RXINTIIR_RD_AVAILABLE (4)
  • THR == 0 && IER.THREIIR_THR_EMPTY (2)
  • Otherwise → IIR_NO_INTERRUPT (7)

When the UART has a pending interrupt, uart.interrupting = true. emu.cpp reads this flag and sets MIP.SEIP, which causes handleIrqAndTrap() to fire a SupervisorExternalInterrupt - routed through the standard IRQ delegation path.


Part 6 - Loading and Booting Linux

ELF Loading

loadElf() in loader.cpp reads the ELF32 header, iterates section headers, and copies SHT_PROGBITS sections into the emulated RAM buffer:

// Collect loadable sections
for (const auto &sh : sh_tbl) {
  if (sh.sh_type == SHT_PROGBITS) {
    ElfSection section{
      sh.sh_addr & 0x7FFFFFFF,  // strip bit 31 for physical offset
      sh.sh_offset,
      sh.sh_size
    };
    sections.push_back(section);
  }
}
 
// Copy sections into emulated RAM
for (auto &s : sections) {
  s.sData.resize(s.size);
  lseek(fd, s.offset, SEEK_SET);
  read(fd, s.sData.data(), s.size);
  std::copy(s.sData.begin(), s.sData.end(), data + s.addr_real);
}

ELF virtual addresses start at 0x80000000. Masking with 0x7FFFFFFF gives the physical offset into the 128 MiB flat buffer.

Linux Image Loading and Boot ABI

The Linux kernel image is not an ELF - it is a flat binary (Linux 6.1.14 rv32nommu + BusyBox initramfs, ~7–8 MiB). loadLinuxImage() reads it directly into data[0]:

// CPU reset state for Linux boot:
pc       = 0x80000000;  // → mem[0]
xreg[10] = 0x0000;      // a0 = hart ID (0)
xreg[11] = 0x1020;      // a1 = DTB pointer

The RISC-V Linux boot ABI requires exactly two things: a0 = hart ID and a1 = physical DTB address. The kernel reads the DTB at startup to discover the memory map, configure UART and CLINT drivers, and start the scheduler. No firmware layer (BBL/OpenSBI) is needed - the kernel boots directly in M-mode from instruction one.

Device Tree Blob

The DTB is mapped at mem[0x1020] via a separate cpu.dtb pointer. It describes:

  • One hart at 100 MHz
  • 128 MiB RAM at 0x80000000
  • UART at 0x10000000
  • CLINT at 0x02000000

The DTB is the sole channel through which the kernel learns about the machine. No BIOS, no ACPI tables, no firmware calls - just a small binary blob at a known address.


Part 7 - The Sv32 MMU

The Sv32 MMU is the most complex single subsystem in the emulator. It is only activated by writing to SATP (0x180) with mode bit 31 set - for nommu Linux, it is never used.

SATP and Page Table Structure

mmuUpdate() is called on every csrw satp:

void mmuUpdate(u32 satp) {
  mmu.mode = (satp >> 31) & 1;
  // 0 = MMU_MODE_OFF (bare/physical)
  // 1 = MMU_MODE_SV32 (paged Sv32)
  mmu.ppn = satp & 0x3fffff;
  // root page dir PA = mmu.ppn × 4096
}

A 32-bit virtual address decomposes as: VPN[1] [31:22], VPN[0] [21:12], page offset [11:0].

PTE Format

Each page table entry (PTE) is 32 bits:

FieldBitsDescription
PPN[1][31:20]Upper physical page number
PPN[0][19:10]Lower physical page number
D[7]Dirty - page has been written
A[6]Accessed - page has been read or written
U[4]User page - accessible in U-mode
X[3]Executable
W[2]Writable
R[1]Readable
V[0]Valid

A PTE is a leaf if R == 1 || X == 1. A PTE is a pointer (non-leaf) if R == 0 && X == 0. V == 0 or (!R && W) → immediate page fault.

Two-Level Page Walk

mmuTranslate() is called on every fetch, load, and store when mmu.mode == MMU_MODE_SV32:

for (int level = 0; level < 2; level++) {
  u32 page_addr;
  if (level == 0) {
    // L0: root page directory, indexed by VPN[1]
    page_addr = mmu.ppn * 4096u
              + ((addr >> 22) & 0x3ff) * 4u;
  } else {
    // L1: L0 PTE's PPN, indexed by VPN[0]
    page_addr = (ppn0 | (ppn1 << 10)) * 4096u
              + ((addr >> 12) & 0x3ff) * 4u;
  }
  u32 pte = memGetWord(page_addr);
  ppn0 = (pte >> 10) & 0x3ff;
  ppn1 = (pte >> 20) & 0xfff;
 
  if (!V || (!R && W)) MMU_FAULT;   // invalid PTE
  if (R || X) break;                 // leaf found, stop walking
  else if (level == 1) MMU_FAULT;   // L1 non-leaf = fault
}
// Assemble physical address
u32 pa = addr & 0xfff;
pa |= super ? ((addr>>12)&0x3ff)<<12 : ppn0<<12;
pa |= ppn1 << 22;
return pa;

4 MiB superpages are supported: a leaf at level 0 uses VPN[0] as part of the offset. PPN[0] must be 0 for a valid superpage; otherwise a fault fires.

Permission Checks and Special Bits

After the walk, access permissions are gated on privilege level and access type:

  • MSTATUS.SUM allows S-mode to access U-mode pages (needed for kernel copy_to_user)
  • MSTATUS.MXR makes executable pages readable (simplifies kernel text mapping)
  • Hardware does not auto-set A or D bits - the OS must set them before a page is used, or accesses fault

Page Faults

Three page fault causes propagate through the standard trap path:

CauseCodeTrigger
InstructionPageFault12Fetch translation failed
LoadPageFault13Data read translation failed
StorePageFault15Data write translation failed

All three set ret.trap.en and flow through handleIrqAndTrap(). If MEDELEG has the corresponding bit set, the kernel's page fault handler runs in S-mode.

nommu Fast Path

The rv32nommu Linux kernel never executes csrw satp, so mmu.mode stays MMU_MODE_OFF = 0 for the entire run:

// Fast exit for bare mode - always taken by nommu
if (mmu.mode == MMU_MODE_OFF)
  return addr;

The full Sv32 walk is compiled in but never reached during nommu Linux operation. Zero overhead for the common case.


Part 8 - Floating-Point Extensions (RV32F/D)

Register File and NaN Boxing

The FP register file is 32 × 64-bit entries, shared between F (single) and D (double) extensions:

u64 freg[32];  // reset: canonical qNaN-boxed

Single-precision values are NaN-boxed per the RISC-V spec §11.3 - the upper 32 bits are set to 0xFFFFFFFF on write, and validated on read:

// Write single-precision
cpu.freg[rd] = 0xFFFFFFFF00000000ULL | bits;
 
// Read single-precision - validate NaN-box
u64 v = cpu.freg[rs];
if ((v >> 32) != 0xFFFFFFFFu)
  return canonical_qNaN;  // upper half corrupted, return NaN

Double-precision writes use the full 64-bit value with no boxing. FCSR (0x003) holds the 5-bit rounding mode frm and 5-bit exception flags fflags.

All rv32f and rv32d ISA tests pass. A hello_linux binary that calls printf with floats runs correctly - musl libc soft-float helpers like __adddf3 and __floatsidf execute via the native F/D extension instructions.


Part 9 - Linux Framebuffer Demo

The emulator exposes /dev/fb0 to the Linux guest via MMIO. A demo program in hello_linux/framebuff.c queries screen dimensions with ioctl(fd, FBIOGET_VSCREENINFO, &vinfo), renders a pattern into a pixel buffer, then writes it out with a single write(fd, g_buf, w * h * 4).

Ten render patterns are included:

PatternTechnique
SMPTE colour barsBroadcast colour reference
HSV gradient / colour wheelhsv2rgb() using fmodf / fabsf
Mandelbrot / Julia setFixed-point 4.12 arithmetic (int64 multiply)
PlasmaSine lookup table with interference
3-D wireframe cubecosf / sinf rotation + float perspective divide
Sierpinski / rings / LissajousProcedural geometry

The graphics stack runs entirely as normal Linux userspace - no special emulator hooks. The floating-point patterns use native RV32F instructions executed by the emulator's F extension.


Part 10 - Building and Testing

Build Targets

make all      # builds rve (g++ -std=c++17 -O2, SDL2 + OpenGL)
make run      # GUI mode - SDL2 window + ImGui
make isas     # runs all 60 ISA compliance tests
make linuxn   # headless Linux boot (-n flag, raw terminal I/O)
make linux    # GUI Linux boot (-r flag, UART in ImGui console)

ISA Test Suite

The test suite covers 60 bare-metal ELF32 binaries organised by extension:

GroupCountTests
rv32ui-p-*43RV32I: add/sub/load/store/branch/jump/lui/auipc
rv32um-p-*8RV32M: mul/mulh/div/rem
rv32ua-p-*9RV32A: amoswap/amoadd/amoand/lrsc
rv32mi/si-p-*-Machine/Supervisor CSR tests

Each test runs in M-mode with no trap delegation. Pass: the test writes 0x55 to SYSCON at 0x11100000, which sets syscon_cmd = 0x5555 and halts the emulator cleanly. Fail: any illegal instruction trap or loop timeout before the SYSCON write. .dump disassembly files ship alongside each binary for cross-referencing a failing PC.


Summary - Three Key Insights

Format parsers and CSR pre-reads are unconditional. All six format structs are populated before any dispatch, and CSR pre-reads happen if the opcode looks like a CSR instruction. Dead computations are eliminated by the compiler. This keeps the dispatch code simple: flat masked switches with early returns.

Sv32 is ~20 lines. Two memGetWord() calls plus permission gates. nommu Linux bypasses the entire thing in the first branch - zero overhead for the common case.

Linux boot needs exactly two things. A flat kernel image at mem[0] and a1 = DTB pointer. The DTB encodes the full machine description - memory, UART, CLINT. No firmware (OpenSBI/BBL) layer is needed; the kernel runs directly in M-mode from instruction one.