Rust on the MOS 6502: Beyond Fibonacci (18 September 2021)

Rust on the MOS 6502: Beyond Fibonacci

18 September 2021 (programming rust llvm c64 retro chip-8)

On August 15th, the following answer showed up on the Retrocomputing Stack Exchange site:

llvm-mos compiles a considerable subset of C++ to 6502 machine code out of the box. IIRC it's just missing "runtime support"; things like RTTI, exceptions, and the runtime layout of VTables. [...]

I came across this answer about a week later, so just in time for ICFP 2021 to start. I also had my Haskell Love talk coming up, and I really wanted to get the RetroClash page in shape by then, so that I can direct people there at the end of my talk.

But still! LLVM, and consequently, Rust, on a C64! Damn if that isn't the perfect nerd-sniping for me. So I looked around to find this 6502.org forum post that shows compiling a Rust implementation of the Fibonacci function and linking it with a small C main. I've reproduced their code here in a Git repo for easier fiddling.

So after ICFP, although other things kept piling up, including a GM-ing gig playing Trail of Cthulhu's excellent The Dance in the Blood and coordinating with the awesome David Thinnes on getting Arrow DECA support into clash-pong in time for Haskell Love, I carved out a weekend to dust off my Rust-on-AVR CHIP-8 implementation to try to get it running on the C64.

As a refresher, the idea behind the AVR CHIP-8 stuff was to write a no_std Rust crate that implements the CHIP-8 VM spec in an idiomatic way without worrying about the fact that we'll use a completely unstable and underdeveloped LLVM backend, and then worry about the AVR- and peripheral-specific concerns in separate code. In particular, chirp8-engine uses algebraic datatypes as the result of the instruction decoder, and relies on the Rust compiler to implement case-of-case optimization between the decoder and the interpreter. This architecture allowed using a "normal" SDL-based frontend for debugging the CHIP-8 implementation itself. The AVR LLVM experience itself was quite a rough one: most of my output was in the form of LLVM bug reports over the course of several months.

Now, standing in front of the MOS 6502 LLVM backend four years later, I knew I wouldn't be able to sink this much time into it. I just wanted to see how far I can get before hitting the inevitable wall of miscompilation or unimplemented features. I did a bunch of cleanup on chirp8-engine and dived head first into LLVM-MOS.

Feasibility study

Even though I grew up on the Commodore 64, I never really learned the details of accessing its hardware from low-level machine code. I used Commodore BASIC, then a BASIC extender called Graphics BASIC, before moving on to x86; and even there, I was always more interested in going "upwards" in abstraction. As a first step, I wanted to see what it would even look like to draw the CHIP-8 screen on the C64.

The CHIP-8 uses 1-bit graphics at 64⨯32 resolution. That is miniscule compared to the C64's 320⨯200 graphical resolution, but it compares nicely with its 40⨯25 text mode. So the idea is to use each quarter character as a CHIP-8 pixel; in other words, we will fit 2⨯2 CHIP-8 pixels onto each C64 character. This uses 32⨯16 characters, so still not quite full screen (there's 4 characters worth of horizontal and vertical border), but it doesn't look too bad and is very easy to implement by just creating a character set of all 16 possible 2⨯2 tiles.

So the first program validated this idea (and also kicked the tires on LLVM-MOS) by taking a CHIP-8 video memory dump from the startup screen of David Winter's game Hidden, and writing some C code that stores it in an array of 32 uint64_ts, generates the right character set, and draws it on the screen. I was worried about the handling of 64-bit variables by LLVM-MOS, but it compiled everything to 8-bit instructions like a champ, getting me to the following screen:

The Hello World of CHIP-8 emulation

Very promising! I switched the video bank, but didn't clear the new memory area, so there's garbage all over the unused area; moreover, there's an off-by-one bug in the character set generation, which is why the "b" of "by" looks wonky; but these are all problems with my code, so I can just fix them! There's stuff on screen on first try, compiled with Clang!

Integrating `chirp8-engine`

This is where the fun started: taking chirp8-engine and writing a C64 frontend for it, all in Rust. This worked remarkably well. I don't know what is the relevant difference between LLVM-MOS and LLVM-AVR, but doing anything with the latter was an uphill battle back in the days, and now, when I wanted to see where it is today, I ended up having to go back a full years worth of Rust releases to get anything that can even compile CHIRP-8; and it still gets miscompiled somewhere, resulting in strange bugs after the game's been running for a while on an AVR simulator. I plan to look into this some day, to figure out what's going on, but I didn't want to let anything detract me from the C64 project.

So anyway, the C64 frontend just writes the right 2⨯2 tile characters to the right memory address, uses the Kernal's keyboard buffer at 0x00c5 for keyboard input (using the 4⨯4 rectangle spanned by the 1 to V keys), and that's pretty much it. The chirp8-engine crate takes care of everything else.

After centering the virtual screen, and using a nicer color scheme, the moment of truth came: I hardcoded Hidden's program into an array, compiled the whole shebang with LLVM-MOS (again, it worked on first try), and booted it up:

CHIRP8-C64 in all its glory

Yep, that's Hidden running all right!

I can't stress enough how smooth that whole process was: zero to playable in one weekend.

Finishing touches

I really wanted to add the ability to load games from disk instead of hardcoding a single game. The challenge here, again, was me not knowing how to do, well, pretty much anything involving IO on the C64 beyond the BASIC LOAD command. Let's try loading the directory first!

The most useful resource I've found for this has been this Codebase 64 page. I'm sure a lot of people would think going to Codebase 64 is overkill for this, but what I've found is that they have very good straightforward implementations before getting into the hyper-optimized demoscene stuff. So I took that directory lister, and ported it first to C, then from C to Rust, changing it in the process to give programmatic access to the directory contents instead of just printing them.

Going from assembly to C was straightforward because I am using Clang from the LLVM-MOS project, so I can just add inline assembly for Kernal calls wherever needed. For example, we can load a filename into the Kernal's IO routines with the following C wrapper:

void k_setnam(const char *fname)
{
    uint16_t ptr = (uint16_t)(fname);

    __attribute__((leaf)) asm volatile(
        "jsr $ffbd"
        :
        : "a" ((uint8_t)(strlen(fname)))
        , "x" ((uint8_t)(ptr >> 0))
        , "y" (((uint8_t)(ptr >> 8)))
        :
        );
}

Unfortunately, we can't port this straight to Rust without going through the pain of building an LLVM-MOS-based Rust compiler. Instead, I decided to use normal Rust nightlies, compile to LLVM IR, and pass that to LLVM-MOS's Clang. But this means, as far as Rust is concerned, it is compiling to x86; and so the inline assembly can't address 6502 registers like X above. But that's OK: we can leave k_setnam and similar functions in C, and use Rust's FFI mechanism to import them.

Using these Kernal functions, I implemented a very basic file selector UI which then calls the Kernal's LOAD routine to load the CHIP-8 memory image into a reserved 4 kB area. That memory area is then passed to Rust as a pointer, and we make a slice from it on the other side:

/* main.c */
uint8_t mem[4 * 1024 - 0x200];
select_and_load_file(mem);
set_frame_irq(&irq);
run(mem, scr);

// machine.rs
#[no_mangle]
pub extern "C" fn run (mem: *mut u8, scr: *mut u8) {
    let mut c64 = C64{
        mem: unsafe { core::slice::from_raw_parts_mut(mem, RAMSIZE) },
        scr: scr,
        vmem: [0;32]
    };

    let mut cpu = CPU::new();
    loop {
        cpu.step(&mut c64);
    }
}

Why not do the same for scr? Interestingly, I have observed that going through a slice for screen manipulation is massively slower than using a raw pointer. I haven't looked into this, and it could just be that I am doing something wrong; for reference, this is the faster (but still quite slow...) version:

fn draw_row (scr: *mut u8, rows: &[u8], y: ScreenY) {
    let mut ptr = unsafe{ scr.offset((4 + (y / 2) as isize) * STRIDE + 4) };

    for i in 0..8 {
        let mut row1 = rows[(y as usize + 0) * 8 + (7 - i)];
        let mut row2 = rows[(y as usize + 1) * 8 + (7 - i)];

        for _ in 0..4 {
            // Take top two bits of row1 and row2
            let ch = (row1 >> 6) | ((row2 >> 6) << 2);
            row1 <<= 2;
            row2 <<= 2;
            unsafe { *ptr = ch; };
            ptr = unsafe{ ptr.offset(1) };
        }
    }
}

fn draw_row (scr: &mut [u8], rows: &[u8], y: ScreenY) {
    let mut ptr = (4 + (y / 2) as usize) * STRIDE + 4;

    for i in 0..8 {
        let mut row1 = rows[(y as usize + 0) * 8 + (7 - i)];
        let mut row2 = rows[(y as usize + 1) * 8 + (7 - i)];

        for _ in 0..4 {
            // Take top two bits of row1 and row2
            let ch = (row1 >> 6) | ((row2 >> 6) << 2);
            row1 <<= 2;
            row2 <<= 2;
            scr[ptr] = ch;
            ptr += 1
        }
    }
}

Evaluation

It is worth pointing out that the amazing thing about chirp8-c64 is not how well it works, but that it works at all. It is quite slow; perfectly playable for turn-based games like Hidden, but practically unplayable for real-time games. I suspect most of the slowdown is in updating the screen. If that's the case, maybe it could be improved dramatically by writing just a bit of hand-rolled assembly.

As a concrete example, let me show you clear_screen, first in Rust, and then the assembly output from LLVM-MOS. I'll be the first to admit that the Rust version isn't written the most efficient way (for example, we update the non-border parts of color_arr twice); but the hope, after all, is to be able to write nicely readable code like that and let the Rust compiler and the LLVM-MOS backend optimize it away.

#[no_mangle]
pub extern "C" fn clear_screen (scr: *mut u8) {
    let arr = unsafe { core::slice::from_raw_parts_mut(scr, STRIDE * 25) };

    for i in 0 .. STRIDE * 25 {
        arr[i] = 0x0f;
    }

    let color_arr = unsafe { core::slice::from_raw_parts_mut(0xd800 as *mut u8, STRIDE * 25) };
    for i in 0 .. STRIDE * 25 {
        color_arr[i] = 0x0b;
    }
    for y in 4 .. 4 + 16 {
        for x in 4 .. 4 + 32 {
            color_arr[y * STRIDE + x] = 0x07;
        }
    }
}

The generated assembly is really something else for what should mostly be just filling two continuous areas of 1000 bytes each:

clear_screen:
	lda	mos8(__rc6)
	pha
	lda	mos8(__rc7)
	pha
	lda	mos8(__rc8)
	pha
	lda	mos8(__rc9)
	pha
	lda	mos8(__rc10)
	sta	__clear_screen_sstk
	lda	mos8(__rc11)
	sta	__clear_screen_sstk+1
	lda	mos8(__rc12)
	sta	__clear_screen_sstk+2
	lda	mos8(__rc13)
	sta	__clear_screen_sstk+3
	lda	mos8(__rc14)
	sta	__clear_screen_sstk+4
	lda	mos8(__rc15)
	sta	__clear_screen_sstk+5
	lda	mos8(__rc16)
	sta	__clear_screen_sstk+6
	lda	mos8(__rc17)
	sta	__clear_screen_sstk+7
	lda	mos8(__rc18)
	sta	__clear_screen_sstk+8
	lda	mos8(__rc19)
	sta	__clear_screen_sstk+9
	lda	mos8(__rc20)
	sta	__clear_screen_sstk+10
	lda	mos8(__rc21)
	sta	__clear_screen_sstk+11
	lda	mos8(__rc22)
	sta	__clear_screen_sstk+12
	lda	mos8(__rc23)
	sta	__clear_screen_sstk+13
	lda	mos8(__rc24)
	sta	__clear_screen_sstk+14
	lda	mos8(__rc25)
	sta	__clear_screen_sstk+15
	lda	mos8(__rc26)
	sta	__clear_screen_sstk+16
	lda	mos8(__rc27)
	sta	__clear_screen_sstk+17
	lda	mos8(__rc28)
	sta	__clear_screen_sstk+18
	lda	mos8(__rc29)
	sta	__clear_screen_sstk+19
	lda	mos8(__rc30)
	sta	__clear_screen_sstk+20
	lda	mos8(__rc31)
	sta	__clear_screen_sstk+21
	lda	mos8(__rc32)
	sta	__clear_screen_sstk+22
	lda	mos8(__rc33)
	sta	__clear_screen_sstk+23
	lda	mos8(__rc34)
	sta	__clear_screen_sstk+24
	lda	mos8(__rc35)
	sta	__clear_screen_sstk+25
	lda	mos8(__rc36)
	sta	__clear_screen_sstk+26
	lda	mos8(__rc37)
	sta	__clear_screen_sstk+27
	lda	mos8(__rc38)
	sta	__clear_screen_sstk+28
	lda	mos8(__rc39)
	sta	__clear_screen_sstk+29
	ldx	#0
	lda	#-60
	stx	mos8(__rc2)
	sta	mos8(__rc3)
	lda	#-40
	stx	mos8(__rc6)
	sta	mos8(__rc7)
	ldx	#-92
	stx	mos8(__rc8)
	sta	mos8(__rc9)
	ldx	#-52
	stx	mos8(__rc10)
	sta	mos8(__rc11)
	ldx	#-12
	stx	mos8(__rc12)
	sta	mos8(__rc13)
	ldx	#28
	lda	#-39
	stx	mos8(__rc14)
	sta	mos8(__rc15)
	ldx	#68
	stx	mos8(__rc16)
	sta	mos8(__rc17)
	ldx	#108
	stx	mos8(__rc18)
	sta	mos8(__rc19)
	ldx	#-108
	stx	mos8(__rc20)
	sta	mos8(__rc21)
	ldx	#-68
	stx	mos8(__rc22)
	sta	mos8(__rc23)
	ldx	#-28
	stx	mos8(__rc24)
	sta	mos8(__rc25)
	ldx	#12
	lda	#-38
	stx	mos8(__rc26)
	sta	mos8(__rc27)
	ldx	#52
	stx	mos8(__rc28)
	sta	mos8(__rc29)
	ldx	#92
	stx	mos8(__rc30)
	sta	mos8(__rc31)
	ldx	#-124
	stx	mos8(__rc32)
	sta	mos8(__rc33)
	ldx	#-84
	stx	mos8(__rc34)
	sta	mos8(__rc35)
	ldx	#-44
	stx	mos8(__rc36)
	sta	mos8(__rc37)
	ldx	#-4
	stx	mos8(__rc38)
	sta	mos8(__rc39)
	lda	#3
	sta	mos8(__rc4)
	ldx	#-24
	lda	#15
	jsr	__memset
	lda	mos8(__rc6)
	sta	mos8(__rc2)
	lda	mos8(__rc7)
	sta	mos8(__rc3)
	lda	#3
	sta	mos8(__rc4)
	ldx	#-24
	lda	#11
	jsr	__memset
	lda	mos8(__rc8)
	sta	mos8(__rc2)
	lda	mos8(__rc9)
	sta	mos8(__rc3)
	lda	#0
	sta	mos8(__rc4)
	ldx	#32
	lda	#7
	jsr	__memset
	lda	mos8(__rc10)
	sta	mos8(__rc2)
	lda	mos8(__rc11)
	sta	mos8(__rc3)
	lda	#0
	sta	mos8(__rc4)
	ldx	#32
	lda	#7
	jsr	__memset
	lda	mos8(__rc12)
	sta	mos8(__rc2)
	lda	mos8(__rc13)
	sta	mos8(__rc3)
	lda	#0
	sta	mos8(__rc4)
	ldx	#32
	lda	#7
	jsr	__memset
	lda	mos8(__rc14)
	sta	mos8(__rc2)
	lda	mos8(__rc15)
	sta	mos8(__rc3)
	lda	#0
	sta	mos8(__rc4)
	ldx	#32
	lda	#7
	jsr	__memset
	lda	mos8(__rc16)
	sta	mos8(__rc2)
	lda	mos8(__rc17)
	sta	mos8(__rc3)
	lda	#0
	sta	mos8(__rc4)
	ldx	#32
	lda	#7
	jsr	__memset
	lda	mos8(__rc18)
	sta	mos8(__rc2)
	lda	mos8(__rc19)
	sta	mos8(__rc3)
	lda	#0
	sta	mos8(__rc4)
	ldx	#32
	lda	#7
	jsr	__memset
	lda	mos8(__rc20)
	sta	mos8(__rc2)
	lda	mos8(__rc21)
	sta	mos8(__rc3)
	lda	#0
	sta	mos8(__rc4)
	ldx	#32
	lda	#7
	jsr	__memset
	lda	mos8(__rc22)
	sta	mos8(__rc2)
	lda	mos8(__rc23)
	sta	mos8(__rc3)
	lda	#0
	sta	mos8(__rc4)
	ldx	#32
	lda	#7
	jsr	__memset
	lda	mos8(__rc24)
	sta	mos8(__rc2)
	lda	mos8(__rc25)
	sta	mos8(__rc3)
	lda	#0
	sta	mos8(__rc4)
	ldx	#32
	lda	#7
	jsr	__memset
	lda	mos8(__rc26)
	sta	mos8(__rc2)
	lda	mos8(__rc27)
	sta	mos8(__rc3)
	lda	#0
	sta	mos8(__rc4)
	ldx	#32
	lda	#7
	jsr	__memset
	lda	mos8(__rc28)
	sta	mos8(__rc2)
	lda	mos8(__rc29)
	sta	mos8(__rc3)
	lda	#0
	sta	mos8(__rc4)
	ldx	#32
	lda	#7
	jsr	__memset
	lda	mos8(__rc30)
	sta	mos8(__rc2)
	lda	mos8(__rc31)
	sta	mos8(__rc3)
	lda	#0
	sta	mos8(__rc4)
	ldx	#32
	lda	#7
	jsr	__memset
	lda	mos8(__rc32)
	sta	mos8(__rc2)
	lda	mos8(__rc33)
	sta	mos8(__rc3)
	lda	#0
	sta	mos8(__rc4)
	ldx	#32
	lda	#7
	jsr	__memset
	lda	mos8(__rc34)
	sta	mos8(__rc2)
	lda	mos8(__rc35)
	sta	mos8(__rc3)
	lda	#0
	sta	mos8(__rc4)
	ldx	#32
	lda	#7
	jsr	__memset
	lda	mos8(__rc36)
	sta	mos8(__rc2)
	lda	mos8(__rc37)
	sta	mos8(__rc3)
	lda	#0
	sta	mos8(__rc4)
	ldx	#32
	lda	#7
	jsr	__memset
	lda	mos8(__rc38)
	sta	mos8(__rc2)
	lda	mos8(__rc39)
	sta	mos8(__rc3)
	lda	#0
	sta	mos8(__rc4)
	ldx	#32
	lda	#7
	jsr	__memset
	lda	__clear_screen_sstk+29
	sta	mos8(__rc39)
	lda	__clear_screen_sstk+28
	sta	mos8(__rc38)
	lda	__clear_screen_sstk+27
	sta	mos8(__rc37)
	lda	__clear_screen_sstk+26
	sta	mos8(__rc36)
	lda	__clear_screen_sstk+25
	sta	mos8(__rc35)
	lda	__clear_screen_sstk+24
	sta	mos8(__rc34)
	lda	__clear_screen_sstk+23
	sta	mos8(__rc33)
	lda	__clear_screen_sstk+22
	sta	mos8(__rc32)
	lda	__clear_screen_sstk+21
	sta	mos8(__rc31)
	lda	__clear_screen_sstk+20
	sta	mos8(__rc30)
	lda	__clear_screen_sstk+19
	sta	mos8(__rc29)
	lda	__clear_screen_sstk+18
	sta	mos8(__rc28)
	lda	__clear_screen_sstk+17
	sta	mos8(__rc27)
	lda	__clear_screen_sstk+16
	sta	mos8(__rc26)
	lda	__clear_screen_sstk+15
	sta	mos8(__rc25)
	lda	__clear_screen_sstk+14
	sta	mos8(__rc24)
	lda	__clear_screen_sstk+13
	sta	mos8(__rc23)
	lda	__clear_screen_sstk+12
	sta	mos8(__rc22)
	lda	__clear_screen_sstk+11
	sta	mos8(__rc21)
	lda	__clear_screen_sstk+10
	sta	mos8(__rc20)
	lda	__clear_screen_sstk+9
	sta	mos8(__rc19)
	lda	__clear_screen_sstk+8
	sta	mos8(__rc18)
	lda	__clear_screen_sstk+7
	sta	mos8(__rc17)
	lda	__clear_screen_sstk+6
	sta	mos8(__rc16)
	lda	__clear_screen_sstk+5
	sta	mos8(__rc15)
	lda	__clear_screen_sstk+4
	sta	mos8(__rc14)
	lda	__clear_screen_sstk+3
	sta	mos8(__rc13)
	lda	__clear_screen_sstk+2
	sta	mos8(__rc12)
	lda	__clear_screen_sstk+1
	sta	mos8(__rc11)
	lda	__clear_screen_sstk
	sta	mos8(__rc10)
	pla
	sta	mos8(__rc9)
	pla
	sta	mos8(__rc8)
	pla
	sta	mos8(__rc7)
	pla
	sta	mos8(__rc6)
	rts

chirp8-c64 is also not a particularly fancy CHIP-8 implementation in terms of features. You can really tell I just wanted to get to the point where I can claim "it works". Sound isn't implemented; there should be some customization for the user to change the screen colors, load a new game, etc. The UI could also use some improvement, and there's a huge performance reserve in using a fastloader instead of the default Kernal routines.

In summary, Rust and LLVM-MOS proved to be a really solid duo for non-performance-intensive development on the Commodore 64!

Full code on GitHub

« Cheap and cheerful microcode compression

All posts

A tiny computer for Tiny BASIC »