Writing a RISC-V emulator in AWK

A few months ago I saw a blogpost about implementing RISC-V with a GPU shader, so as to boot Linux inside of VRChat. Then one day the idea of a RISC-V emulator in AWK hit me, and when I got some free time between exams I decided to give it a go.

AWK is designed for processing text, like gathering statistics on web server log files, not systems programming, so it seemed like an amusing project to try and get working. This project was done solely with GNU AWK (gawk), primarily because it supported bitwise operations. Converting this work into purely POSIX AWK might well be possible.

It turns out much of implementing an emulator like this is rather boring rote-work: adding support for new instructions as they arise, and tracking down strange bugs in what you’ve already implemented. Rather than cover every issue I found and fixed throughout development, I’ll run through the key milestones, along with some of the most interesting or amusing bugs. The source code of my progress so far is on GitHub if you’re interested.

Before I get going I’d like to thank pi for their original work on rvc, which is much more impressive than this, and helped me realise I needed to investigate clint for interrupts (as well as being a good source to crib device trees from!). Also, despite (or maybe in accordance with) appearances, I’m no expert on any of these things; in particular anything kernel-related is based entirely on my reading of the source code, with a view to fixing whatever issue I had at the time, not understanding all of it. Corrections are welcome!

First Steps

Before we start thinking about executing any RISC-V machine code, we should probably make some machine code. The first piece of code the emulator successfully ran calculated the factorial of 5, a result that has always stuck in my head for some reason (120). Rather than setup cross-compilers locally, the code was compiled then disassembled by the online Compiler Explorer, then assembled again into hex by an online RISC-V assembler, before being transformed into a binary file locally. Don’t worry, this convoluted process didn’t last long!

In order to actually load the machine code into the emulator’s memory, it was base64 encoded so that AWK could process it without worrying about whitespace characters causing each line to be several records. The emulator itself is invoked on this base64, which decodes the code into the emulated memory, then runs the machine (in the END AWK block, so it happens after all of the file has been processed).

Not many instructions needed to be implemented to get this to work, but a lot of groundwork needed to be laid: registers, memory, the skeleton of instruction decoding. A few of the RISC-V instruction fields are in the same position for every instruction, but there are several instruction formats to deal with, that encode the immediate value, if any, differently.

After execution finished, the emulator dumped the value of r10, which by convention holds the return value of a function. Sure enough, 5 factorial came out as 120—always handy to check.

C Toolchain

I haven’t seen many production workflows involve copying assembly from Compiler Explorer, so the next step was to set up a decent way of compiling C code for the emulator. I wrote a minimal crt.s, that merely initialises the stack pointer, calls main, then does an EBREAK to stop execution. Then I stared at a linker script for far too long to try and get things at the right address in memory.

I’m still not 100% sure why the linker script even works—loading a raw ELF into memory is definitely not the usual way of starting a machine.

I transferred the factorial test program to a .c file locally, and integrated it with the toolchain: a Makefile setup to compile it, link it with the crt, and base64 encode it.

Planning for Linux

In order to celebrate the upcoming year of the Linux desktop, I decided my goal was to try and boot Linux on this emulator. According to the rvc blogpost, if I opted to emulate 64 bit RISC-V, I could use Linux without an MMU, which would save me thinking about page tables.

We will need serial output, so we could see kernel output. I’m not sure if serial input is feasible under AWK, as there doesn’t seem to be non-blocking IO. Interrupts are also required both for scheduling and system calls, so we’d need to implement that.

The daunting task was implementing over 90 instructions correctly enough to boot Linux. In comparison to x86, though, I suppose I should count my blessings.

Serial and Bootloader

I decided to try writing a bootloader ready to transfer execution to a real Linux kernel. Writing a bootloader also gives an opportunity to get some kind of emulated serial working with something simpler than the full kernel.

The 8250 serial chip seemed simple and well-supported (including as an earlycon for Linux, which will let us get output before drivers get properly initialised). After reading through some example code I figured out which part of the data sheet actually described the registers I needed, and then it wasn’t too hard to write both a minimal driver for the bootloader, and some code to emulate it: just always be ready to write and pause the entire machine on every character write!

All the bootloader does is print a small splash screen and jump to the Linux kernel, which is loaded at a higher address in RAM. To do this, the base64 format was extended to add support for “seeking”, changing the load address somewhere else, so it becomes more like a sparse memory dump.

Linux originally spewed out corrupted text for its version line—this ended up being due to an error in my implementation of the shift instructions causing memcpy to corrupt its data. But that wasn’t actually the next issue that needed resolving.

Floating Point Addresses

Before we got any output from Linux, there were issues with incorrect jump offsets, but “emulating” the machine code in Python lead to the correct values. It seemed something was wrong with basic arithmetic within AWK.

I discovered gawk stores all numeric values as doubles. This wasn’t an issue with any of my test programs that resided at low addresses, but the kernel was loaded much higher in memory, and there just wasn’t enough precision to do the kind of offset sums the kernel needed accurately. Luckily, gawk has a --bignum option that forces it to use arbitrary precision arithmetic for all numeric values. I’d have settled for just using 64-bit integers, but I couldn’t find an option for that.

Memblock

We were seeing some promising early boot output from Linux, but then “memblock” failed to allocate any memory. I hadn’t heard about it, but it appears to be a simple memory allocator for the kernel’s early-boot.

We weren’t passing a device tree to the kernel, so it had no idea where or how much RAM there was. However, even after adding the device tree and the required memory nodes, memblock was still failing to allocate. Tracing the kernel’s execution (by calling printf in the emulator whenever PC got to certain points) revealed that memblock_add (the function that adds a memory range to memblock’s set of memory) was never called, despite the device-tree scanning code running.

This took a while to track down, and it wasn’t even a bug with the CPU itself in the end, but with the base64 loader. The regex to match base64 lines for loading didn’t include ‘=’, the base64 padding character. This meant the last line of the bootloader’s, device-tree’s, and kernel’s base64 representation wasn’t loaded into RAM. Nothing had needed the last few bytes from the kernel or bootloader, but crucial device-tree information was in the omitted portion, so the kernel never got to hear about the RAM.

With this bug fixed (a one character change), the kernel would print many more debug messages, but then hang. Tracing execution again as before I found out it had finished the first part of its initialisation, and was now expecting to do the rest of it on a kernel thread. Threads mean a scheduler, and a scheduler means interrupts, so that’s our next step.

Interrupts

For interrupts we need to turn our attention to the RISC-V privileged specification, which defines a number of CSRs (Control and Status Registers). The core of an interrupt is of course that the program counter gets set somewhere else, but when that happens we need to fill in some of these CSRs to give the interrupt handler context about where we were executing before, and what interrupt is firing.

The Linux scheduler wants to make use of a timer interrupt, that interrupts the system at some point in the future. The RISC-V specification defines the memory mapped registers mtime and mtimecmp, the former incrementing at a set rate, and the latter setting the point at which we want the timer interrupt to fire. Whenever mtime is greater than or equal to mtimecmp, a timer interrupt is considered pending. By telling Linux the frequency of mtime, it can use this to schedule future events. I incremented mtime every instruction, and picked a frequency at random: we’re not aiming for any kind of precise timekeeping here.

The RISC-V specification isn’t specific about where mtime and mtimecmp are, just that they exist. For that, we turn to the CLINT (Core-Local Interruptor), as implemented by SiFive on their RISC-V boards. Linux has a driver for this already, so we emulate the memory-mapped registers at some addresses, then tell Linux the base address in the device tree, and we’re good to go!

The first version of interrupts caused unhandled instructions to start cropping up in the Linux interrupt handler. By setting memory breakpoints I discovered the IRQ stack was becoming so big it encompassed the handler’s code section, and destroyed the machine code. This took a while to figure out: what indicated the problem was that mscratch (a CSR intended to be used as scratch space by interrupt handlers) was coming back from interrupts with what looked like an address. Comments in the kernel’s interrupt handler indicated that mscratch was supposed to be 0 if execution was previously in the kernel, but it was never cleared after a return from an interrupt. Future interrupts saw a non-zero mscratch and assumed the interrupt was from user space. I didn’t fully resolve why this lead to stack bounds issues.

But why was mscratch coming back from an interrupt as non-zero? I can see the part of the interrupt handler that zeroes it! It took me too long to try reading the code around where mscratch is zeroed, and discover it only happens if the privilege level from before the interrupt wasn’t zero (meaning it was supervisor or machine mode). I hadn’t implemented privilege modes, but leaving the previous privilege mode in mstatus as zero made part of the kernel interrupt handler assume this interrupt was from user space, and it should leave mscratch as it found it. However, because the start of the interrupt handler uses a different condition to determine if the interrupt is from user space (whether mscratch is zero), mscratch has already been trodden on by the handler.

By setting two bits in mstatus to indicate we came from machine mode, the interrupts worked perfectly. After some minor fixes to a few atomic instructions that weren’t atomic enough to stop the kernel getting stuck in semaphores, we are rewarded with our first kernel panic that doesn’t seem to be due to emulation issues:

Kernel Log
Linux version 5.18.0 () (riscv64-unknown-elf-gcc () 9.3.0, GNU ld () 2.34) #11 Wed Jun 15 09:29:32 BST 2022
Machine model: rawk
earlycon: uart8250 at MMIO 0x0000000001000000 (options '115200n8')
printk: bootconsole [uart8250] enabled
printk: debug: skip boot console de-registration.
Zone ranges:
  DMA32    [mem 0x0000000080000000-0x0000000080ffffff]
  Normal   empty
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x0000000080000000-0x0000000080ffffff]
Initmem setup node 0 [mem 0x0000000080000000-0x0000000080ffffff]
riscv: base ISA extensions aim
riscv: ELF capabilities aim
pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
pcpu-alloc: [0] 0
Built 1 zonelists, mobility grouping off.  Total pages: 4040
Kernel command line: loglevel=8 earlycon=uart8250,mmio,0x1000000,115200n8 keep_bootcon
Dentry cache hash table entries: 2048 (order: 2, 16384 bytes, linear)
Inode-cache hash table entries: 1024 (order: 1, 8192 bytes, linear)
Sorting __ex_table...
mem auto-init: stack:off, heap alloc:off, heap free:off
Memory: 13052K/16384K available (1756K kernel code, 583K rwdata, 352K rodata, 131K init, 204K bss, 3332K reserved, 0K cma-reserved)
SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
riscv-intc: 64 local interrupts mapped
clint: clint@2000000: timer running at 1048576 Hz
clocksource: clint_clocksource: mask: 0xffffffffffffffff max_cycles: 0x1ef4687b1, max_idle_ns: 3526361616864 ns
sched_clock: 64 bits at 1048kHz, resolution 953ns, wraps every 2199023255348ns
Console: colour dummy device 80x25
printk: console [tty0] enabled
Calibrating delay loop (skipped), value calculated using timer frequency.. 2.09 BogoMIPS (lpj=4194)
pid_max: default: 32768 minimum: 301
Mount-cache hash table entries: 512 (order: 0, 4096 bytes, linear)
Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes, linear)
clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
futex hash table entries: 256 (order: 0, 6144 bytes, linear)
clocksource: Switched to clocksource clint_clocksource
workingset: timestamp_bits=62 max_order=12 bucket_order=0
Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
List of all partitions:
No filesystem could mount root, tried:

Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
CPU: 0 PID: 1 Comm: swapper Not tainted 5.18.0 #11
Hardware name: rawk (DT)
Call Trace:
[<00000000800032f0>] walk_stackframe+0x0/0xc8
[<00000000801ae300>] panic+0x118/0x2e8
[<00000000801b9eb4>] mount_block_root+0x22c/0x230
[<00000000801ba1e0>] prepare_namespace+0x158/0x198
[<00000000801b2f40>] rest_init+0xa4/0xa8
[<00000000801b2f60>] kernel_init+0x1c/0x118
[<00000000801b2f40>] rest_init+0xa4/0xa8
[<0000000080001ff0>] ret_from_syscall_rejected+0x8/0xc
---[ end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0) ]---

Performance

The next step would be preparing an initramfs, but getting to this kernel panic takes 12 minutes on my machine, and I’m not particularly thrilled by the prospect of debugging issues that only crop up after that long. I expected AWK to be slow, but I’m sure I can eek some more performance out of it somehow.

I tried passing --profile to gawk to get an idea of where time was being spent. Unfortunately, the emulator would fail straight away with an invalid instruction when profiling was enabled. To get it to boot, I had to access the emulated memory at the load address before starting execution. Even indexing into it without storing the value into a variable worked, which shouldn’t have an effect on program semantics, and seems to point to a gawk issue. When I got profiling working, I discovered this made the emulator get to the panic in only 5 minutes—more than 50% faster! I tried --no-optimize, which --profile also implies, to test if it was an edge-case where optimisations made things slower. It still took 12 minutes, so now I’m really confused.

I’m not running the latest version of gawk, so I’ll probably try that next, and if there are still strange performance disparities between profiling and not profiling, I guess I’ll grit my teeth and dig into the gawk source!