All posts

A small benchmark for functional languages targeting web browsers

2 July 2022 (programming haskell idris javascript)

I had an idea for a retro-gaming project that will require a MOS 6502 emulator that runs smoothly in the browser and can be customized easily. Because I only need the most basic of functionality from the emulation (I don't need to support interrupts, timing accuracy, or even the notion of cycles), I thought I'd just quickly write one. This post is not about the actual retro-gaming project that prompted this, but instead, my experience with the performance of the generated code using various functional-for-web languages.

As I usually do in situations like this, I started with a Haskell implementation to serve as a kind of executable specification, to make sure my understanding of the details of various 6502 instructions is correct. This Haskell implementation is nothing fancy: the outside world is modelled as a class MonadIO m => MonadMachine m, and the CPU itself runs in MonadMachine m => ReaderT CPU m, using IORefs in the CPU record for registers.

The languages

Ironing out all the wrinkles took a whole day, but once it worked well enough, it was time for the next step: rewriting it in a language that can then target the browser. PureScript seemed like an obvious choice: it's used a lot in the real world so it should be mature enough, and with how simple my Haskell code is, PureScript's idiosyncracies compared to Haskell shouldn't really come into play beyond the syntax level. The one thing that annoyed me to no end was that numeric literals are not overloaded, so all Word8s in my code had to be manually fromIntegral'd; and, in an emulator of an eight-bit CPU, there's a ton of Word8 literals...

The second contender was Idris 2. I've had good experience with Idris 1 for the web when I wrote the ICFP Bingo web app, but that project was all about the DOM manipulation and no computation. I was curious what performance I can get from Idris 2's JavaScript backend.

And then I had to include Asterius, a GHC-based compiler emitting WebAssembly. Its GitHub page states it is "actively maintained by Tweag I/O", but it's actually in quite a rough shape: the documentation on how to build it is out of date, so the only way to try it is via a 20G Docker container...

Notably missing from this list is GHCJS. Unfortunately, I couldn't find an up-to-date version of it; it seems the project, or at least work on integrating with standard Haskell tools like Stack, has died off.

To compare performances, I load the same memory image into the various emulators, set the program counter to the same starting point, and run it for 4142 instructions until a certain target instruction is reached. To paper over the browser's JavaScript JIT engine etc., each test runs for 100 times first as a warm-up, then 100 times measured.

Beside the PureScript, Idris 2, and GHC/Asterius implementations, I have also added a fourth version to serve as the baseline: vanilla JavaScript. Of course, I tried to make it as close to the functional versions as possible; I hope what I wrote is close to what could reasonably be expected as the output of a compiler.

Performance results

The following numbers come from the collected implementations in this GitHub repo. The PureScript and Idris 2 versions have been improved based on ideas from the respective Discord channels. For PureScript, using the CPS-transformed version of Reader helped; and in the case of Idris 2, Stefan Höck's changes of arguments instead of ReaderT, and using PrimIO when looping over instructions, improved performance dramatically.

Implementation Generated code size (bytes) Average time of 4142 instructions (ms)
JavaScript 12,877 2.12
Idris 2 60,379 6.38
PureScript 151,536 137.03
GHC/Asterius 1,448,826 346.73

So Idris 2 comes out way ahead of the pack here: unless you're willing to program in JavaScript, it's by far your best bet both for tiny deployment size and superb performance. All that remains to improve is to compile monad transformer stacks better so that the original ReaderT code works as well as the version using implicit parameters

To run the benchmark yourself, checkout the GitHub repo, run make in the top-level directory, and then use a web browser to open _build/index.html and use the JavaScript console to run await measureAll().

Cheap and cheerful microcode compression

2 May 2022 (programming haskell clash fpga retro)

This post is about an optimization to the Intel 8080-compatible CPU that I describe in detail in my book Retrocomputing in Clash. It didn't really fit anywhere in the book, and it isn't as closely related to the FPGA design focus of the book, so I thought writing it as a blog post would be a good idea.

Retrocomputing with Clash

Just like the real 8080 from 1974, my Clash implementation is microcoded: the semantics of each machine code instruction of the Intel 8080 is described as a sequence of steps, each step being the machine code instruction of an even simpler, internal micro-CPU. Each of these micro-instruction steps are then executed in exactly one clock cycle each.

My 8080 doesn't faithfully replicate the hardware 8080's micro-CPU; in fact, it doesn't replicate it at all. It is a from-scratch design based on a black box understanding of the 8080's instruction set, and the main goal was to make it easy to understand, instead of making it efficient in terms of FPGA resource usage. Of course, since my micro-CPU is different, the micro-instructions have no one to one correspondence with the orignal Intel 8080, and so the microcode is completely different as well.

In the CPU, after fetching the machine code instruction byte, we look up the microcode for that byte, and then execute it cycle by cycle until we're done. This post is about how to store that microcode efficiently.

Continue reading »

Rust on the MOS 6502: Beyond Fibonacci

18 September 2021 (programming rust llvm c64 retro)

On August 15th, the following answer showed up on the Retrocomputing Stack Exchange site:

llvm-mos compiles a considerable subset of C++ to 6502 machine code out of the box. IIRC it's just missing "runtime support"; things like RTTI, exceptions, and the runtime layout of VTables. [...]

I came across this answer about a week later, so just in time for ICFP 2021 to start. I also had my Haskell Love talk coming up, and I really wanted to get the RetroClash page in shape by then, so that I can direct people there at the end of my talk.

But still! LLVM, and consequently, Rust, on a C64! Damn if that isn't the perfect nerd-sniping for me. So I looked around to find this 6502.org forum post that shows compiling a Rust implementation of the Fibonacci function and linking it with a small C main. I've reproduced their code here in a Git repo for easier fiddling.

So after ICFP, although other things kept piling up, including a GM-ing gig playing Trail of Cthulhu's excellent The Dance in the Blood and coordinating with the awesome David Thinnes on getting Arrow DECA support into clash-pong in time for Haskell Love, I carved out a weekend to dust off my Rust-on-AVR CHIP-8 implementation to try to get it running on the C64.

Full code on GitHub

Continue reading »

A tiny computer for Tiny BASIC

17 November 2020 (programming haskell clash fpga retro)

Just a quick post to peel back the curtain a bit on a Clash example I posted to Twitter. My tweet in question shows the following snippet, claiming it is "a complete Intel 8080-based FPGA computer that runs Tiny BASIC":

logicBoard
    :: (HiddenClockResetEnable dom)
    => Signal dom (Maybe (Unsigned 8)) -> Signal dom Bool -> Signal dom (Maybe (Unsigned 8))
logicBoard inByte outReady = outByte
  where
    CPUOut{..} = intel8080 CPUIn{..}

    interruptRequest = pure False

    (dataIn, outByte) = memoryMap _addrOut _dataOut $ do
        matchRight $ do
            mask 0x0000 $ romFromFile (SNat @0x0800) "_build/intel8080/image.bin"
            mask 0x0800 $ ram0 (SNat @0x0800)
            mask 0x1000 $ ram0 (SNat @0x1000)
        matchLeft $ do
            mask 0x10 $ port $ acia inByte outReady

I got almost a hundred likes on this tweet, which isn't too bad for a topic as niche as hardware design with Haskell and Clash. Obviously, the above tweet was meant as a brag, not as a detailed technical description; but given the traction it got, I thought I might as well try to expand a bit on it.

Continue reading »

A "very typed" container for representing microcode

15 September 2020 (programming haskell clash)

I've been thinking a bit about describing microcode lately. My motivation was the Intel 8080-compatible CPU I've been building for my upcoming Clash book. As with everything else for that book, the challenge is not in getting it to work — rather, it is in writing the code as close as possible to the way you would want to explain it to another person.

So in the context of a microprocessor as simple as the Intel 8080 and using synchronous RAM, I think of the microcode as a sequence of steps, where each step consists of an internal CPU state transition, and a memory read or write request. For example, the machine instruction 0x34 (mnemonic INR M) increments by one the byte pointed to by the register pair HL. In my core, the micro-architecture has an 8-bit value- and a 16-bit address-register; the latter can be used for memory addressing. To use something else for addressing, you need to load it into the address buffer first. So the steps to implement INR M are:

  1. Get value of HL register pair into the address buffer
  2. Indirect read via the address buffer into the value buffer
  3. Replace value buffer's contents with its increment
  4. Update the status register (flags like "was the latest value zero")
  5. Indirect write

However, memory access happens on the transition between cycles, so the final write will not be its own step; rather, it happens as the postamble of step 4. Similarly, the correct address will have to be put on the address pins in the preamble of step 2 for the load to work out:

  1. Get HL into address buffer
  2. Set address to address buffer's contents for reading
  3. Store read value from data-in into value buffer
  4. Increment value buffer
  5. Update status register
  6. Set address to address buffer's contents for writing

What makes this tricky is that on one hand, we want to describe preambles as part of their respective step, but of course for the implementation it is too late to process them when we get to that step. So I decided to write out the microcode as a sequence of triplets, corresponding to the preamble, the state transition, and the postamble, and then transform it into a format where preambles are attached to the previous step:

[ (Nothing,       Get2 rHL,          Nothing)
, (Just Indirect, ReadMem            Nothing)
, (Nothing,       ALU ADD Const0x01, Nothing)
, (Nothing,       UpdateFlags,       Just Indirect)
]

Here, Indirect addressing means setting the address pins from the address buffer (as opposed to, e.g. the program counter); if it is in the postamble (i.e. write) position, it also means the write-request pin should be asserted.

So this is what the microcode developer writes, but then we can transform it into a format that consists of a state transition paired with the addressing:

(Nothing,
[ (Get2 rHL,          Just (Left Indirect))
, (ReadMem            Nothing)
, (ALU ADD Const0x01, Nothing)
, (UpdateFlags,       Just (Right Indirect))
])

So we're done, without having done anything interesting enough to warrant a blog post.

Or are we?

Disallowing memory addressing conflicts

Note that in the format we can actually execute, the addressing at each step is either a Left read address, or a Right write address (or Nothing at all). But what if we had two subsequent micro-steps, where the first one has a write request in its postamble, and the second one has a read request in its preamble? We are describing a CPU more than 40 years old, it is to be connected to single-port RAM, so we can't do read and write at the same time. This constraint is correctly captured by the Maybe (Either Read Write) type of memory requests in the normalized form, but it is not enforced by our naïve [(Maybe Read, Transition, Maybe Write)] type for what the microcode developer writes.

So this is what I set out to solve: to give an API for writing microcode that has self-contained steps including the read addressing, but still statically disallows conflicting writes and reads from subsequent steps. We start by going full Richard Eisenberg and lifting the memory addressing directives to the type level using singletons. While we're at it, let's also turn on Haskell 98 mode:

{-# LANGUAGE DataKinds, PolyKinds, ConstraintKinds, GADTs, FlexibleContexts #-}
{-# LANGUAGE TypeOperators, TypeFamilies, TypeApplications, ScopedTypeVariables #-}
{-# LANGUAGE StandaloneDeriving, DeriveFunctor #-}

data Step (pre :: Maybe a) (post :: Maybe b) t where
    Step :: Sing pre -> t -> Sing post -> Step pre post t
deriving instance Functor (Step pre post)

The plan, then, is to do enough type-level magic to only allow neighbouring Steps if at most one of the first post- and the second preamble is a type-level Just index.

The operations we want to support on microcode fragments is cons-ing a new Step and appeding fragments. For the first one, we need to check that the postamble of the new Step is compatible with the first preamble of the existing fragment; for the latter, we need the same check between the last postamble of the first fragment and the first preamble of the second fragment. First, let's codify what "compatible" means here:

type family Combine (post :: Maybe b) (pre :: Maybe a) :: Maybe (Either a b) where
    Combine Nothing Nothing = Nothing
    Combine (Just post) Nothing = Just (Right post)
    Combine Nothing (Just pre) = Just (Left pre)

Importantly, there is no clause for Combine (Just post) (Just pre).

Getting dizzy with the thin air of the type level? Let's leave ourselves a thread leading back to the term level:

combine
    :: forall a b (post :: Maybe b) (pre :: Maybe a).
       (SingKind a, SingKind b, SingI (Combine post pre))
    => Sing post -> Sing pre -> Demote (KindOf (Combine post pre))
combine _ _ = demote @(Combine post pre)

(This post is not sponsored by Singpost BTW).

Cons- and append-able fragments

For the actual fragments, we can store them internally almost in the normalized format, i.e. as a term-level list of (Maybe a, [(t, Maybe (Either a b))]). Almost, but not quite, because the first a and the last b need to appear in the index, to be able to give a restricted type to cons and append. So instead of storing them in the list proper, we will store them as separate singletons:

data Ends a b
    = Empty
    | NonEmpty (Maybe a) (Maybe b)

data Amble (ends :: Ends a b) t where
    End :: Amble Empty t
    More
        :: forall (a0 :: Maybe a) (bn :: Maybe b) n t. ()
        => Sing a0
        -> [(t, Demote (Maybe (Either a b)))]
        -> t
        -> Sing bn
        -> Amble (NonEmpty a0 bn) t
deriving instance Functor (Amble ends)

Note that we need a special Empty index value for End instead of just NonEmpty Nothing Nothing, because starting with an empty Amble, the first cons needs to change both the front-end preamble and the back-end postamble, whereas later cons operators should only change the front-end.

type family Cons (b1 :: Maybe b) (ends :: Ends a b) where
    Cons b1 Empty = b1
    Cons b1 (NonEmpty a1 bn) = bn

We can now try writing the term-level cons. The first cons is easy, because there is no existing front-end to check compatibility with:

cons
    :: forall (a0 :: Maybe a) b1 (ends :: Ends a b) t. ()
    => Step a0 b1 t -> Amble ends t -> Amble (NonEmpty a0 (Cons b1 ends)) t
cons (Step a0 x b1) End = More a0 [] x b1
cons (Step a0 x b1) (More a1 xs xn bn) = More a0 ((x, _):xs) xn bn

We get into trouble when trying to fill in the hole in the cons to a non-empty Amble. And we should be, because nowhere in the type of cons so far have we ensured that b1 is compatible with the front-end of ends. We will have to use another type family for that, to pattern-match on Empty and NonEmpty ends:

type family CanCons (b1 :: Maybe b) (ends :: Ends a b) :: Constraint where
    CanCons b1 Empty = ()
    CanCons (b1 :: Maybe b) (NonEmpty a1 bn :: Ends a b) =
        (SingKind a, SingKind b, SingI (Combine b1 a1))

Unsurprisingly, the constraints needed to be able to cons are exactly what we need to fill the hole with the term-level value of combine b1 a1:

cons
    :: forall (a0 :: Maybe a) b1 (ends :: Ends a b) t. (CanCons b1 ends)
    => Step a0 b1 t -> Amble ends t -> Amble (NonEmpty a0 (Cons b1 ends)) t
cons (Step a0 x b1) End = More a0 [] x b1
cons (Step a0 x b1) (More a1 xs xn bn) = More a0 ((x, combine b1 a1):xs) xn bn

Now we are cooking with gas: we can re-use this idea to implement append by ensuring we CanCons the first fragment's backend to the second fragment:

type family CanAppend (ends1 :: Ends a b) (ends2 :: Ends a b) :: Constraint where
    CanAppend Empty ends2 = ()
    CanAppend (NonEmpty a1 bn) ends2 = CanCons bn ends2

type family Append (ends1 :: Ends a b) (ends2 :: Ends a b) where
    Append Empty ends2 = ends2
    Append ends1 Empty = ends1
    Append (NonEmpty a0 bn) (NonEmpty an bm) = NonEmpty a0 bm

append :: (CanAppend ends1 ends2) => Amble ends1 t -> Amble ends2 t -> Amble (Append ends1 ends2) t
append End ys = ys
append (More a0 xs xn bn) End = More a0 xs xn bn
append (More a0 xs xn bn) (More an ys ym bm) = More a0 (xs ++ [(xn, combine bn an)] ++ ys) ym bm

We finish off the implementation by writing the translation into the normalized format. Since the More constructor already contains almost-normalized form, we just need to take care to snoc the final element onto the result:

stepsOf
    :: forall (ends :: Ends a b) t. (SingKind a, SingKind b)
    => Amble ends t
    -> (Maybe (Demote a), [(t, Maybe (Demote (Either a b)))])
stepsOf End = (Nothing, [])
stepsOf (More a0 xs xn bn) = (fromSing a0, xs ++ [(xn, Right <$> fromSing bn)])

Putting a bow on it

What we have so far works, but there are a couple of straightforward improvements that would be a shame not to implement.

Nicer way to take steps

As written, you would have to use Step like this:

Step (sing @Nothing) UpdateFlags (sing @(Just Indirect))

All this singing noise would be more annoying than the Eurovision Song Contest, so I wanted to avoid it. The idea is to turn those Sing-typed arguments into just type-level arguments; then do some horrible RankNTypes magic to keep the parameter order. Prenex? What is that?

{-# LANGUAGE RankNTypes #-}
      
step :: forall pre. (SingI pre) => forall t. t -> forall post. (SingI post) => Step pre post t
step x = Step sing x sing

So now we will be able to write code like step @Nothing UpdateFlags @(Just Indirect) and get a Step type inferred that has the preamble and the postamble appearing in the indices.

Custom type error message

Suppose we make a mistake in our microcode, and accidentally want to write after one step and read before the next (using >:> for infix cons):

step @Nothing         (Get2 rHL)                       @(Just IncrPC) >:>
step @(Just Indirect) ReadMem                          @Nothing >:>
step @Nothing         (Compute Const01 ADD KeepC SetA) @Nothing >:>
step @Nothing         UpdateFlags                      @(Just Indirect) >:>
End

This is rejected by the type checker, of course; however, the error message is not as informative as it could be, as it faults the missing SingI instance for a stuck type family application:

• No instance for (SingI (Combine ('Just 'IncrPC) ('Just 'Indirect)))
arising from a use of ‘>:>

With GHC's custom type errors feature, we can add a fourth clause to our Combine type family. Unfortunately, this requires turning on UndecidableInstances for now:

{-# LANGUAGE UndecidableInstances #-}      
import GHC.TypeLits

type Conflict post pre =
    Text "Conflict between postamble" :$$: Text "  " :<>: ShowType post :$$:
    Text "and next preamble" :$$: Text "  " :<>: ShowType pre

type family Combine (post :: Maybe b) (pre :: Maybe a) :: Maybe (Either a b) where
    Combine Nothing Nothing = Nothing
    Combine (Just post) Nothing = Just (Right post)
    Combine Nothing (Just pre) = Just (Left pre)
    Combine (Just post) (Just pre) = TypeError (Conflict post pre)

With this, the error message changes to:

• Conflict between postamble 'IncrPC and next preamble 'Indirect

Much nicer!

Tracking fragment length

The final difference between what we have described here and the code I use for real is that in the real version, Amble also tracks its length in an index. This is needed because the CPU core is used not just for emulation, but also FPGA synthesis; and in real hardware, we can't just store lists of unbounded size in the microcode ROM. So instead, microcode is described as a length-indexed Amble n ends t, and then normalized into a Vec n instead of a list. Each instruction can be at most 10 steps long; everything is then ultimately normalized into a uniformly typed Vec 10 by padding it with "go start fetching next instruction" micro-ops.

The full implementation

Find the full code on GitHub, next to the rest of the Intel 8080 core.

Older posts: