Encsé writes in his blog that one of the reasons he created a tiny CPS-based Scheme interpreter was because he realized he hasn't done any fun side projects all year. So I thought I'd make an inventory of fun hacks I've done in 2015.
Some of the answers I wrote this year on Stack Overflow required me to learn just enough of something new to be able to answer the question:
Then, there were those where it turned out there was a bug to be found by scratching the surface of the question deep enough:
Then there were the answers that were just too much fun to write:
All in all, it seems this has been quite a productive year for me out of the office, even if you exclude the Stack Overflow part. I was actually surprised how long this list was while compiling it for this blog post. Maybe I should write a list like this every year from now...
In my quest to build more and more complicated computers on FPGAs armed with nothing but a crappy hobbist mindset and some hazy ideas of how Kansas Lava is supposed to work, I've reached another milestone: my first real computer.
That is, unlike the Brainfuck CPU that I designed myself, or the CHIP-8, which was originally a virtual machine spec (with all the implementation leeway that implies), this latest subject is a bona fide 8-bit home computer from the seventies: the Commodore PET.
The PET is a very simple machine compared to later Commodore models, which is why I thought it would make a good first step on a journey that I hope will one day culminate in implementing a Commodore 64. Its centerpiece is the MOS 6502 CPU (practically the same as the MOS 6510 used in the C=64), and there are only four other components: a text-only video signal generator and three IO interface chips (two PIA's and one VIA) for keyboard, Datasette and extension port communication. Just hooking up one of the PIAs is enough to get a minimal system working with keyboard input.
12 KBytes of PET ROM contain implementation of basic IO routines (the so-called "kernal"), the full-screen text editor, and Microsoft's BASIC interpreter. Then there's a separate character ROM (not addressable from the CPU) used by the video generator.
The 6502 microprocessor was a staple of the eight-bit home computer era of the late seventies and eighties. By today's standards, it is incredible to imagine what it must have been like to design it manually, drawing the layout with pencils on paper. On the other hand, if it was designed in such a low-tech way, I figured it shouldn't be too difficult to build something compatible using modern tools even by a hobbist like myself. And of course there are already dozens of home-built 6502 implementations out there, to varying degrees of compatibility.
The ultimate reference on the 6502 must be the Visual 6502 Project which I deliberately avoided consulting. I don't really see the educational value in copying the original 6502 design; so instead, I went with a more black-box approach by just looking at the opcode descriptions and interrupt model and working from that.
The first milestone I aimed for was to get enough of the CPU working that I can run the sample programs on 6502asm.com, which defines a tiny microcomputer-like architecture that doesn't have interrupts or any fancy video modes: you just have 32×32 pixels with a fixed 16-color palette mapped to main RAM from $0200, and a zero page-mapped register for keyboard input that you can do polling on. The Kansas Lava implementation is really simple and I plan to reuse it later if I do a similar project with the Z80.
My workflow was that I would use ca65 to assemble test programs, burn them into ROM, and run it in the Kansas Lava simulator for a couple thousand cycles; then render the video RAM into a GTK+ window. I would start with this program that does nothing but moves data around in memory (drawing the Commodore logo pixel by pixel), and basically I implemented the 6502 opcodes as I went along. After two days of work, I finally got it working:
Seeing this was an incredible feeling. The input was valid 6502 machine code, and my very own CPU managed to run it correctly for the approximately 40,000 cycles that it took to draw this image. There was no stopping at this point: I already had a working VGA frame buffer implementation from the CHIP-8, so next day I synthesized it and run it on real hardware, my venerable Papilio Pro:
As I added more and more opcodes and started running more and more complicated programs, things very quickly stopped working. My CPU was full of bugs, and figuring out what went wrong by looking at the simulation logs after running it for tens of thousands of cycles was very tedious.
And so, it was at this point that I started adding unit tests. The framework for writing tests exposes a monad where the available effects are making observations on the state of the system (CPU registers and contents of the memory) and executing instructions. This presents an API that allows writing tests in an imperative way:
php = do flags <- observe statusFlags sp <- observe regSP execute0 0x08 sp' <- observe regSP pushed < observe $ mem (stackAddr <$> sp) assertEq "Stack pointer is decremented" sp' (pred <$> sp) assertEq "Status is correctly pushed" pushed flags
A test like this is turned into a ROM image containing $08 at the address pointed to by the reset vector. The simulation is then run until the CPU enters the Fetch internal state for the second time (the first time is when it fetches the opcode under testing, i.e. the PHP ($08) instruction), and then the observations are evaluated by looking at the simulation output in the same cycles as the Fetch state. Of course, this means you shouldn't be able to write tests like the following:
impossiblyDynamicTest = do arg <- observe regX execute1 0x00 arg a' <- observe regA assertEq "A is updated" a' arg
This is ensured by observe returning values wrapped in an Obs type, and execute1 requiring unwrapped arguments:
observe :: Query a -> TestM (Obs a) execute1 :: Byte -> Byte -> TestM () assertEq :: (Eq a, Show a) => String -> Obs a -> Obs a -> TestM ()
To allow assertions over derived values, Obs is an applicative functor (in fact, it is the free applicative functor over the co-Yoneda functor of the primitive observations).
I think this approach has merit as a general framework for hardware simulator-based unit testing and I intend to extend it and maybe even carve it out into a separate library in the future.
Once I had a sufficiently working CPU, I started building the other pieces around it. I took the PET emulator from the VICE suite and commented out all the PIA and VIA code, replacing writes with nops and reads with hardcoded values, until I was able to boot it up with the stock ROM to get to the READY. prompt. Of course, since the PIA supplying the interrupt used for timing was removed by that point, I had no flashing cursor or keyboard input. All in all, the system got to a steady state in about 80,000 operations. (Since my implementation is not yet cyle-accurate, I had to switch to counting operations instead of cycles beyond this point. Every operation is at least as fast as on the real chip, so I hope by adding some wait cycles I'll be able to take care of this at some later point.)
After hooking up the same hardcoded values on the same addresses to the CPU, the next step was running the simulator and peeking at the video memory area ($8000..$8FFF on the PET), using the original fonts to render the screen. The initial version showed there might be someone home (sorry for crap quality on the screenshot):
By comparing detailed logs from running the emulator and the simulator, I was able to make observations like "the first 12,345 steps seem to be in agreement", which was a big boost to productivity, getting me, in short order, to this:
After fixing some more bugs in the arithmetic opcodes, I was finally rewarded by this sight:
While working on the CPU, I also started writing the character generator, on top of the VGA signal generator in the kansas-lava-papilio package that I originally made for the CHIP-8. This way, the VGA synchronization signals were abstracted away from me and I just had to take care of pumping out the actual pixels. This turned out to be tricker than I originally thought, since you have to time all read-aheads just right so that everything is at hand just in time for the next pixel. So before it finishes drawing the 8 pixels that make up a single row of a character, the next character index is loaded from RAM, and then the character ROM is consulted for the first row of the font image of the next indexed character. Initial versions were having some ghosting issues, or even more fun, full character transpositions (like showing the character from one line above in the first position of each line).
The Commodore PET diverts the vsync signal from the video generator to one of the PIA chips, which generates a CPU interrupt that can be acknowledged by reading from one of its memory-mapped registers. So the next obvious step was to implement this functionality to get the cursor blinking! This required more than just implementing a PIA, since I didn't even have interrupts in the CPU at that point.
But all that work was totally worth it:
The current version supports keyboard input from PS/2 keyboards (but not all keys are mapped yet), so for the first time since I started working on this more than a month ago, it can be used to write and run BASIC programs!
What you can't see on the video below is that there's still a bug somewhere that causes the classic 10 PRINT "FOO": 20 GOTO 10 program to terminate with an out of memory error after some time.
Apart from fixing these bugs, the big outstanding feature is to add Datasette support so that programs can be loaded and saved to virtual "casettes". For a first version, I'll just burn some extra ROM onto the FPGA containing the tape images and hook that up to the PIA controlling the casette player; but I guess the proper way to do this would be to use something like an SD card reader to get proper persistent, read-writable storage. Or maybe, alternatively, have some kind of serial-over-USB communication with a computer acting as the Datasette unit.
Someone posted to the Haskell subreddit this blogpost of Lennart where he goes step-by-step through implementing an evaluator and type checker for CoC. I don't know why this post from 2007 showed up on Reddit this week, but it's a very good post, pedagogically speaking. Go and read it.
In this post, I'd like to elaborate on the simply-typed lambda calculus part of his blogpost. His typechecker defines the following types for representing STLC types, terms, and environments:
data Type = Base | Arrow Type Type deriving (Eq, Show) type Sym = String data Expr = Var Sym | App Expr Expr | Lam Sym Type Expr deriving (Eq, Show)
The signature of the typechecker presented in his post is as follows:
type ErrorMsg = String type TC a = Either ErrorMsg a newtype Env = Env [(Sym, Type)] deriving (Show) tCheck :: Env -> Expr -> TC Type
My approach is to instead create a representation of terms of STLC in such a way that only well-scoped, well-typed terms can be represented. So let's turn on a couple of heavy-weight language extensions from GHC 7.8 (we'll see how each of them is used), and define a typed representation of STLC terms:
{-# LANGUAGE GADTs, StandaloneDeriving #-} {-# LANGUAGE DataKinds, KindSignatures, TypeFamilies, TypeOperators #-} {-# LANGUAGE ScopedTypeVariables #-} {-# LANGUAGE TemplateHaskell #-} -- sigh... import Data.Singletons.Prelude import Data.Singletons.TH import Data.Type.Equality -- | A (typed) variable is an index into a context of types data TVar (ts :: [Type]) (a :: Type) where Here :: TVar (t ': ts) t There :: TVar ts a -> TVar (t ': ts) a deriving instance Show (TVar ctx a) -- | Typed representation of STLC: well-scoped and well-typed by construction data TTerm (ctx :: [Type]) (a :: Type) where TConst :: TTerm ctx Base TVar :: TVar ctx a -> TTerm ctx a TLam :: TTerm (a ': ctx) b -> TTerm ctx (Arrow a b) TApp :: TTerm ctx (Arrow a b) -> TTerm ctx a -> TTerm ctx b deriving instance Show (TTerm ctx a)
The idea is to represent the context of a term as a list of types of variables in scope, and index into that list, de Bruijn-style, to refer to variables. This indexing operation maintains the necessary connection between the pointer and the type that it points to. Note the type of the TLam constructor, where we extend the context at the front for the inductive step.
To give a taste of how convenient it is to work with this representation programmatically, here's a total evaluator:
-- | Interpretation (semantics) of our types type family Interp (t :: Type) where Interp Base = () Interp (Arrow t1 t2) = Interp t1 -> Interp t2 -- | An environment gives the value of all variables in scope in a given context data Env (ts :: [Type]) where Nil :: Env '[] Cons :: Interp t -> Env ts -> Env (t ': ts) lookupVar :: TVar ts a -> Env ts -> Interp a lookupVar Here (Cons x _) = x lookupVar (There v) (Cons _ xs) = lookupVar v xs -- | Evaluate a term of STLC. This function is total! eval :: Env ctx -> TTerm ctx a -> Interp a eval env TConst = () eval env (TVar v) = lookupVar v env eval env (TLam lam) = \x -> eval (Cons x env) lam eval env (TApp f e) = eval env f $ eval env e
Of course, the problem is that this representation is not at all convenient for other purposes. For starters, it is certainly not how we would expect human beings to type in their programs.
My version of the typechecker is such that instead of giving the type of a term (when it is well-typed), it instead transforms the loose representation (Term) into the tight one (TTerm). A Term is well-scoped and well-typed (under some binders) iff there is a TTerm corresponding to it. Let's use singletons to store type information in existential positions:
$(genSingletons [''Type]) $(singDecideInstance ''Type) -- | Existential version of 'TTerm' data SomeTerm (ctx :: [Type]) where TheTerm :: Sing a -> TTerm ctx a -> SomeTerm ctx -- | Existential version of 'TVar' data SomeVar (ctx :: [Type]) where TheVar :: Sing a -> TVar ctx a -> SomeVar ctx -- | A typed binder of variable names data Binder (ctx :: [Type]) where BNil :: Binder '[] BCons :: Sym -> Sing t -> Binder ts -> Binder (t ': ts)
Armed with these definitions, we can finally define the type inferer. I would argue that it is no more complicated than Lennart's version. In fact, it has the exact same shape, with value-level equality tests replaced with Data.Type.Equality-based checks.
-- | Type inference for STLC infer :: Binder ctx -> Term -> Maybe (SomeTerm ctx) infer bs (Var v) = do TheVar t v' <- inferVar bs v return $ TheTerm t $ TVar v' infer bs (App f e) = do TheTerm (SArrow t0 t) f' <- infer bs f TheTerm t0' e' <- infer bs e Refl <- testEquality t0 t0' return $ TheTerm t $ TApp f' e' infer bs (Lam v ty e) = case toSing ty of SomeSing t0 -> do TheTerm t e' <- infer (BCons v t0 bs) e return $ TheTerm (SArrow t0 t) $ TLam e' inferVar :: Binder ctx -> Sym -> Maybe (SomeVar ctx) inferVar (BCons u t bs) v | v == u = return $ TheVar t Here | otherwise = do TheVar t' v' <- inferVar bs u return $ TheVar t' $ There v' inferVar _ _ = Nothing
Note that pattern matching on Refl in the App case brings in scope type equalities that are crucial to making infer well-typed.
Of course, because of the existential nature of SomeVar, we should provide a typechecker as well which is a much more convenient interface to work with:
-- | Typechecker for STLC check :: forall ctx a. (SingI a) => Binder ctx -> Term -> Maybe (TTerm ctx a) check bs e = do TheTerm t' e' <- infer bs e Refl <- testEquality t t' return e' where t = singByProxy (Proxy :: Proxy a) -- | Typechecker for closed terms of STLC check_ :: (SingI a) => Term -> Maybe (TTerm '[] a) check_ = check BNil
(The SingI a constraint is an unfortunate implementation detail; the kind of a is Type, which is closed, so GHC should be able to know there is always going to be a SingI a instance).
To review, we've written a typed embedding of STLC into Haskell, with a total evaluator and a typechecker, in about 110 lines of code.
If we were doing this in something more like Agda, one possible improvement would be to define a function untype :: TTerm ctx a -> Term and use that to give check basically a type of Binder ctx -> (e :: Term) -> Either ((e' :: TTerm ctx a) -> untype e' == e -> Void) (TTerm ctx a), i.e. to give a proof in the non-well-typed case as well.
I've been thinking lately about arrows in relation to applicative functors and monads. The difference between the latter two is easy to intuit (and I'll describe it via an example below), but I never managed to get the same level of understanding about arrows. There's a somewhat famous paper about this question, which has a very clear-cut diagram showing that applicatives embed into arrows and arrows embed into monads (and both containments are non-trivial), which I understood as meaning every arrow is an applicative functor, and every monad is an arrow.
At first glance, this makes sense, given the well-known result that monads are exactly equivalent to arrows that are also instances of ArrowApply, as witnessed by the Haskell types Kleisli and ArrowMonad. However, there's no immediately obvious reason why you couldn't also turn an applicative functor into an arrow, so how does that leave any room for arrows to be different from applicatives? (As an aside, the fact that applicative functors have kind ⋆ → ⋆ and arrows have kind ⋆ → ⋆ → ⋆ has been a huge complication for me in trying to compare them).
Now, finally, based on the helpful replies to that StackOverflow question and the associated Reddit thread, I am confident enough to answer my own question.
Tom Ellis suggested thinking about a concrete example involving file I/O, so let's compare three approaches to it using the three typeclasses. To make things simple, we will only care about two operations: reading a string from a file and writing a string to a file. Files are going to be identified by their file path:
type FilePath = String
Our first I/O interface is defined as follows:
data IOM ∷ ⋆ → ⋆ instance Monad IOM readFile ∷ FilePath → IOM String writeFile ∷ FilePath → String → IOM ()
Using this interface, we can for example copy a file from one path to another:
copy ∷ FilePath → FilePath → IOM () copy from to = readFile from >>= writeFile to
However, we can do much more than that: the choice of files we manipulate can depend on effects upstream. For example, the below function takes an index file which contains a filename, and copies it to the given target directory:
copyIndirect ∷ FilePath → FilePath → IOM () copyIndirect index target = do from ← readFile index copy from (target ⟨/⟩ to)
On the flip side, this means there is no way to know upfront the set of filenames that are going to be manipulated by a given value action ∷ IOM α. By "upfront", what I mean is the ability to write a pure function fileNames :: IOM α → [FilePath].
Of course, for non-IO-based monads (such as ones for which we have some kind of extractor function μ α → α), this distinction becomes a bit more fuzzy, but it still makes sense to think about trying to extract information without evaluating the effects of the monad (so for example, we could ask "what can we know about a Reader Γ α without having a value of type Γ at hand?").
The reason we can't really do static analysis in this sense on monads is because the function on the right-hand side of a bind is in the space of Haskell functions, and as such, is completely opaque.
So let's try restricting our interface to just an applicative functor.
data IOF ∷ ⋆ → ⋆ instance Applicative IOF readFile ∷ FilePath → IOF String writeFile ∷ FilePath → String → IOF ()
Since IOF is not a monad, there's no way to compose readFile and writeFile, so all we can do with this interface is to either read from a file and then postprocess its contents purely, or write to a file; but there's no way to write the contents of a file into another one.
How about changing the type of writeFile?
writeFile′ ∷ FilePath → IOF (String → ())
The main problem with this interface is that while it would allow writing something like
copy ∷ FilePath → FilePath → IOF () copy from to = writeFile′ to ⟨*⟩ readFile from
it leads to all kind of nasty problems because String → () is such a horrible model of writing a string to a file, since it breaks referential transparency. For example, what do you expect the contents of out.txt to be after running this program?
(λ write → [write "foo", write "bar", write "foo"]) ⟨$⟩ writeFile′ "out.txt"
First of all, let's get two arrow-based I/O interfaces out of the way that don't (in fact, can't) bring anything new to the table: Kleisli IOM and Applicarrow IOF.
The Kleisli-arrow of IOM, modulo currying, is:
readFile ∷ Kleisli IOM FilePath String writeFile ∷ Kleisli IOM (FilePath, String) ()
Since writeFile's input still contains both the filename and the contents, we can still write copyIndirect (using arrow notation for simplicity). Note how the ArrowApply instance of Kleisli IOM is not even used.
copyIndirect ∷ Kleisli IOM (FilePath, FilePath) () copyIndirect = proc (index, target) → do from ← readFile ↢ index s ← readFile ↢ from writeFile ↢ (to, s)
The Applicarrow of IOF would be:
readFile ∷ FilePath → Applicarrow IOF () String writeFile ∷ FilePath → String → Applicarrow IOF () ()
which of course still exhibits the same problem of being unable to compose readFile and writeFile.
Instead of transforming IOM or IOF into an arrow, what if we start from scratch, and try to create something in between, in terms of where we use Haskell functions and where we make an arrow? Take the following interface:
data IOA ∷ ⋆ → ⋆ → ⋆ instance Arrow IOA readFile ∷ FilePath → IOA () String writeFile ∷ FilePath → IOA String ()
Because writeFile takes the content from the input side of the arrow, we can still implement copy:
copy ∷ FilePath → FilePath → IOA () () copy from to = readFile from >>> writeFile to
However, the other argument of writeFile is a purely functional one, and so it can't depend on the output of e.g. readFile; so copyIndirect can't be implemented with this interface.
If we turn this argument around, this also means that while we can't know in advance what will end up being written to a file (before running the full IOA pipeline itself), but we can statically determine the set of filenames that will be modified.
Monads are opaque to static analysis, and applicative functors are poor at expressing dynamic-time data dependencies. It turns out arrows can provide a sweet spot between the two: by choosing the purely functional and the arrowized inputs carefully, it is possible to create an interface that allows for just the right interplay of dynamic behaviour and amenability to static analysis.
tl;dr: I've built a computer on a Xilinx FPGA using Kansas Lava, based on a virtual machine design from the mid seventies.
I would be lying if I said I always wanted to build my own computer. For a very long time, hardware didn't tickle my curiosity at all; and even today, I prefer staying away from dangling wires and soldering irons. I like my computing platforms to Just Work, and hardware problems are just a huge hassle. But then in 2010 some coworkers of mine started getting into electronics and I learned from them just enough to start hooking some ICs up on a breadbord, and it seemed like a very fun diversion from all the high-level, abstract, softwarey stuff. In some sense, it filled the same void in me that assembly programing would probably have had. But even back in the days on the Commodore 64, I was looking for more BASIC extensions instead of going downwards to assembly / machine code.
One thing led to another and I was creating a Brainfuck CPU on a Papilio FPGA. It was a very satisfying experience, plus, I got to learn about a completely new world (that of digital hardware design and hardware description languages). So ever since then, I had a voice in the back of my head saying I should go all in, and implement a proper full computer, with I/O and all. But jumping straight into replicating a real computer seemed like trying to run before I can walk.
I can't remember now how I bumped into the CHIP-8 platform, but when I read about it in detail, I realized this is something I, a self-taught HDL guy, could realistically implement. To recap, it's a virtual machine spec from the mid seventies indended to be implemented on home computers of the time. It is super low performance, meaning I could implement everything the most naïve way possible and still get something workable out of it: graphics is 64×32 black & white, RAM is short of 4KBytes, CPU speed is not really part of the spec, and the only timers provided run at 60 Hz.
I can do this!
The FPGA board that I targeted is a Papilio One 500K, which has 500K... somethings called "system gates", and about 40KByte of SRAM. The Arcade MegaWing provides a D-sub 15 VGA connector and two PS/2 ports, among some other stuff that I am not using for this project.
The home computer targeted by the original CHIP-8 spec only had a four-by-four keypad, so I am not 100% happy about using a PS/2 keyboard for the input. Maybe a later version could use a bespoke keypad connected to the unused pins of the LogicStart MegaWing. However, by using a PS/2 keyboard, I could not worry about the hardware side of things and just implement a decoder for the PS/2 protocol.
In terms of software, there are several games available for the CHIP-8. Maybe even donzens! But just for fun, after finishing the machine itself, I ended up writing my own game as well: a crappy port of the currently trending 2048 game.
I would say the CPU itself is a RISC architecture, in that it has 16 registers and not a whole lot of operations. But I'm not sure it would be called RISC back in its day.
The aforementioned 64×32 one-bit graphics is manipulated via a sprite interface: there's a special CPU opcode for xor-blitting an 8-pixel-wide, variable-height sprite onto any part of the screen. I went with the obvious way of implementing it as a 2048-bit frame buffer.
To validate my understanding, I first created a software emulator in Haskell. That unearthed a ton of edge cases and situations where I was not reading the spec with enough attention. Once I had the emulator working well enough to play the sample games I found, I enumerated the modules I'll need to implement.
Since I've already done the Brainfuck CPU, I didn't foresee too much difficulty in implementing the CPU proper (oh boy was I wrong). However, the world of peripherals was a new one to me.
I've never looked into VGA signal timings in detail before, and for some reason I just assumed that it's going to be just as complicated as PAL, about which I knew just enough to know that you have to generate all kinds of elaborate sync patterns. So actually reading the VGA spec was a relief, and I quickly came up with a scheme where my CHIP-8 computer would be running at 50 MHz, so that it can easily implement the 25 MHz pixel clock needed for 640×480@60 Hz. I went with this mode because it has several advantages:
The Papilio One itself has a clock of 32 MHz, and I first hoped that 32 Mhz and 25 Mhz is close enough that I can just generate a signal using the 32 MHz as the pixel clock. Turns out that's not quite how signal timings work.
Luckily, I've found out that the Papilio One also has something called a DCM which can divide and multiply the clock signal. I was able to go to 25 or 50 MHz easily with it. It's a bit of a black box component: I had to run a wizard in the Xilinx IDE to get a small binary file describing my DCM parameters, and then I integrated it into my build process by running another Xilinx program on it which spits out some magic bitstream.
The PS/2 protocol is a simple serial protocol with a slow (~10 KHz) clock and one parity bit per 8 data bits. Decoding it into a stream of bytes was a straightforward thing once I hunted down an actual PS/2 keyboard, since it turned out those USB to PS/2 dongles don't really work with regular USB keyboards; rather, the keyboard have to be made ready to "speak" PS/2 over the dongle. So I ended up getting a proper PS/2 keyboard from Mustafa Centre (thanks to Viki's sister Dori for picking it up for me); god only knows why they still had one on stock.
The Xilinx tool suite seems to be aimed at being used from the IDE. This has several drawbacks: version controlling is complicated because generated artifacts litter the source tree; the editor itself in the IDE is of course crap compared to Emacs; and most importantly, I didn't intend to manually write either VHDL or Verilog. As I've mentioned before in my post about the Brainfuck CPU, I've found both of these hardware description languages to be lacking in the same way mainstream imperative programming languages are: the tools you, the developer, is given to introduce your own abstractions are very poor. So I planned to use Kansas Lava, an HDL embedded in Haskell.
Now, the first thing to note about Kansas Lava is, as nice at is, the software package itself is bit-rotten. The latest version available on Hackage cannot even be compiled with GHC 7.6. While the change to fix that is trivial, after that, you can easily bump into bugs in the Lava compiler itself. But more on that later. Later versions available from GitHub are not even self-consisent between the various dependencies. I've put my patched-up version of Kansas Lava and some of the packages it dependens on on GitHub and I'm trying to get Andy Gill to allow me to upload the bundle as a 0.2.4.1 update to Hackage. Don says I should maybe say fuck it and create my own Singapore Lava fork, just to stay with the Lava tradition of a twisty little maze of incompatible forks, all alike.
However, when it all works, it is amazing. I was able to extend the library of Papilio-specific Lava code that I created for the Brainfuck project with reusable modules for the VGA signal generator and the PS/2 decoder in such a way that they should be very easy to be re-used for any other future projects. And it's all type-safe, so for example, the Papilio Arcade MegaWing VGA port is statically enforced to be 4+4+4 bits whereas the LogicStart MegaWing is 3+3+2.
But there was one bug that left a bitter aftertaste in my mouth. Once I had both the CPU and the VGA parts working and I started running some test programs on the thing, I noticed that the framebuffer blitting is exhibiting "or" behaviour instead of "xor". Running the same code in the Lava simulator and dumping the contents of the framebuffer, it showed the correct, "xor" behaviour. After weeks of frustration and a complete rework of the communication system between the CPU, the VGA signal generator and the frame buffer to add memory read bussing, I finally took a look at the Lava compiler's source code to find that it simulates the xor2 primitive as xor but compiles it to or. How do you not notice this bug? Has the Kansas Lava suite been used by anyone for everything at all in the history of ever???
The final result, after much trial and error, looking at the Lava simulator's output, and pouring over the code, is available here. I'm reasonably happy about the code base, except for the implementation of the CPU itself, which feels a bit spaghetti to me. Especially around the parts where it's waiting for results to come back from main RAM or the video framebuffer.
Below are some photographs and videos of it running two games: the memory puzzle game Hidden and the 2048 clone mentioned above. Unfortunately, the Tetris game from the screenshot above seems to have a bug in its input handling, in that it samples the keyboard at an unthrottled frequency; thus, even the shortest keypress sends the falling piece to the edge of the wall. I'll eventually need to disassemble the game and fix this.
The VGA signal generator is not as neat as it could be, because it doesn't do pre-fetching. This means by the time the current CHIP-8 pixel's value is read from the framebuffer, it is already too late, the first of the 8 real pixels for the given virtual CHIP-8 pixel should already have been drawn. This results in some ghosting. But since I know what's causing it, I will hopefully be able to fix this in some next version. But right now I'm too happy to have the whole thing just working.
... so I made a compiler to turn a Turing tarpit into a one instruction set computer so you can use the processor inside your processor to compute between cycles of computation.
There's this wonderful paper detailing the computation model discovered in the x86 memory management unit, or MMU for short. Basically, by setting up the page table right, you can arrange for a page fault handler to lead to another page fault. If you set things up juuuust right, you can get a whole cascade of page faults. You can then modulate this to implement a single instruction, Harvard-architecture, Turing-complete computer using Move-Decrement-BranchIfZero, or MOVDBZ for short, with the following semantics in pseudocode:
MOVDBZ (x ← y-1) THEN l ELSE l′ = do val ← read x if val == 0 then do write y 0 jump l else do write y (val-1) jump l′
I'm going to refer to x and y as the source and target registers below, but it's important to understand that these are not the real registers of the x86 processor that we are talking about here; rather, these are the registers of this virtual machine living inside the MMU. The term Harvard architecture means the instruction labels l and l′ come from a completely separate namespace as the register (c.f. von Neumann computer).
Another important thing is that from the real x86 processor's point of view, the whole computation takes up zero cycles. That is because the resolution of the page fault happens between (or, to be more precise, in the middle of) instructions, so the whole cascade will be executed by the MMU before the next x86 instruction runs.
The paper linked above has all the details; this is just a summary to motivate what follows. Readers of my blog already know Brainfuck is Turing-complete, since we managed to compile register machine code into Brainfuck. So let's continue that chain of reasoning (motivated by this Proggit comment), and prove the Turing-completeness of MovDBz by compiling Brainfuck into MovDBz.
First of all, I'm going to add two new instructions to our one-instruction machine; hopefully we can all agree they don't change the spirit of the challange:
Here's a straightforward interpreter for this dialect of MovDBz.
At initial glance, there seem to be two tricky aspects of Brainfuck with respect to MovDBz:
The first problem is solved by basically turning this:
increment cells[ptr]into this:
if ptr == 0 then increment cells[0]This latter construct is easy to implement if we nominate one of our MovDBz registers as a temporary register tmp:
scan(0): MOVDBZ (tmp ← ptr-1) THEN found(0) ELSE scan(1)where n is the maximum index of Brainfuck cells.
Note that this means the generated MovDBz program's size will be linear in the number of Brainfuck cells. The Brainfuck spec calls for 30,000 cells; when compiling to MovDBz, it is usually a good idea to use a lot less cells.
At first glance, it might seem impossible to implement incrementing when all we have is a single decrement operator. Modular arithmetic comes to the rescue: since x + 1 ≡ x - 255 (mod 256), we can do repeated decrements (255 times) to emulate incrementing. Of course, we also have to handle specially the case when the value gets to 0.
So with the help of a constant register C256 we keep clamped to 256, we can turn this:
increment cells[i]Into this:
start: MOVDBZ (tmp ← C256-1) THEN count ELSE countWe can implement incrementing the pointer (> in Brainfuck) similarily, by having another constant register storing the number of cells (i.e. n+1).
To prototype these ideas, I've written a compiler that uses the above techniques to turn Brainfuck code into MovDBz. The Haskell implementation is available on Github, but be aware that it is somewhat-horrible spaghetti code.
One nice aspect of it, though, is that the compiler generates semantic labels and registers (like Cell i for the register containing the i'th Brainfuck cell, or ScanFinished i (Inc l) for the label of the instruction that starts incrementing the Cell i and eventually goes to label l when finished). This can make the output almost readable... For example, here's the result of compiling -.>. with two Brainfuck cells:
λ> mapM_ print $ compileBF 1 [DecData, Output, IncPtr, Output] (Src 0,MOVDBZ C0 C0 (S (Scan 0 (Dec (Src 1)))) (S (Scan 0 (Dec (Src 1))))) (Src 1,MOVDBZ C0 C0 (S (Scan 0 (Print (Src 2)))) (S (Scan 0 (Print (Src 2))))) (Src 2,MOVDBZ C0 C0 (S (DoIncPtr (Src 3))) (S (DoIncPtr (Src 3)))) (Src 3,MOVDBZ C0 C0 (S (Scan 0 (Print (Src 4)))) (S (Scan 0 (Print (Src 4))))) (Src 4,HALT) (S (Scan 0 (Dec (Src 1))),MOVDBZ Ptr Tmp (S (ScanFinished 0 (Dec (Src 1)))) (S (Scan 1 (Dec (Src 1))))) (S (Scan 0 (Print (Src 2))),MOVDBZ Ptr Tmp (S (ScanFinished 0 (Print (Src 2)))) (S (Scan 1 (Print (Src 2))))) (S (Scan 0 (Print (Src 4))),MOVDBZ Ptr Tmp (S (ScanFinished 0 (Print (Src 4)))) (S (Scan 1 (Print (Src 4))))) (S (Scan 1 (Dec (Src 1))),MOVDBZ Tmp Tmp (S (ScanFinished 1 (Dec (Src 1)))) (S End)) (S (Scan 1 (Print (Src 2))),MOVDBZ Tmp Tmp (S (ScanFinished 1 (Print (Src 2)))) (S End)) (S (Scan 1 (Print (Src 4))),MOVDBZ Tmp Tmp (S (ScanFinished 1 (Print (Src 4)))) (S End)) (S (ScanFinished 0 (Dec (Src 1))),MOVDBZ C0 C0 (S (DecCell 0 (Src 1))) (S (DecCell 0 (Src 1)))) (S (ScanFinished 0 (Print (Src 2))),PRINT (Cell 0) (Src 2)) (S (ScanFinished 0 (Print (Src 4))),PRINT (Cell 0) (Src 4)) (S (ScanFinished 1 (Dec (Src 1))),MOVDBZ C0 C0 (S (DecCell 1 (Src 1))) (S (DecCell 1 (Src 1)))) (S (ScanFinished 1 (Print (Src 2))),PRINT (Cell 1) (Src 2)) (S (ScanFinished 1 (Print (Src 4))),PRINT (Cell 1) (Src 4)) (S (DecCell 0 (Src 1)),MOVDBZ (Cell 0) (Cell 0) (S (UnderflowCell 0 (Src 1))) (Src 1)) (S (DecCell 1 (Src 1)),MOVDBZ (Cell 1) (Cell 1) (S (UnderflowCell 1 (Src 1))) (Src 1)) (S (UnderflowCell 0 (Src 1)),MOVDBZ CMaxData (Cell 0) (S End) (Src 1)) (S (UnderflowCell 1 (Src 1)),MOVDBZ CMaxData (Cell 1) (S End) (Src 1)) (S (DoIncPtr (Src 3)),MOVDBZ CMaxAddr Tmp (S (DoIncPtrLoop (Src 3) IncPtrDecCounter)) (S (DoIncPtrLoop (Src 3) IncPtrDecCounter))) (S (DoIncPtrLoop (Src 3) IncPtrDecCounter),MOVDBZ Tmp Tmp (Src 3) (S (DoIncPtrLoop (Src 3) IncPtrDecPtr))) (S (DoIncPtrLoop (Src 3) IncPtrDecPtr),MOVDBZ Ptr Ptr (S (DoIncPtrLoop (Src 3) IncPtrUnderflowPtr)) (S (DoIncPtrLoop (Src 3) IncPtrDecCounter))) (S (DoIncPtrLoop (Src 3) IncPtrUnderflowPtr),MOVDBZ CMaxAddr Ptr (S (DoIncPtrLoop (Src 3) IncPtrDecCounter)) (S (DoIncPtrLoop (Src 3) IncPtrDecCounter))) (S End,HALT)
Which, after label and register layouting, comes out as this:
λ> mapM_ print $ layout 1 $compileBF 1 [DecData, Output, IncPtr, Output] (0,MOVDBZ 0 0 5 5) (1,MOVDBZ 0 0 6 6) (2,MOVDBZ 0 0 25 25) (3,MOVDBZ 0 0 7 7) (4,HALT) (5,MOVDBZ 3 4 11 8) (6,MOVDBZ 3 4 12 9) (7,MOVDBZ 3 4 13 10) (8,MOVDBZ 4 4 14 29) (9,MOVDBZ 4 4 15 29) (10,MOVDBZ 4 4 16 29) (11,MOVDBZ 0 0 17 17) (12,PRINT 5 2) (13,PRINT 5 4) (14,MOVDBZ 0 0 18 18) (15,PRINT 6 2) (16,PRINT 6 4) (17,MOVDBZ 1 4 19 19) (18,MOVDBZ 1 4 22 22) (19,MOVDBZ 4 4 1 20) (20,MOVDBZ 5 5 21 19) (21,MOVDBZ 1 5 19 19) (22,MOVDBZ 4 4 1 23) (23,MOVDBZ 6 6 24 22) (24,MOVDBZ 1 6 22 22) (25,MOVDBZ 2 4 26 26) (26,MOVDBZ 4 4 3 27) (27,MOVDBZ 3 3 28 26) (28,MOVDBZ 2 3 26 26) (29,HALT)
For me, the best way to understand something is usually to implement it myself. So when I first started getting serious about Haskell, I implemented (in Common Lisp) a simple lazy functional programming language that used graph rewriting for the interpretation. Then, years later, I mentioned this implementation in a Reddit thread about writing a Lisp interpreter in Haskell (maybe inspired by the well-known Haskell tutorial and this blogpost's namesake). I now decided to clean up that old code, throw out all the unnecessary cruft, and arrive at something easy to understand. The finished interpreter is less than 500 lines of Common Lisp (not including the type checker which will be the subject of my next post).
I should also add that of course we're not going to actually implement Haskell here. As we'll see, the language is a very much simplified model of pure, lazy functional languages. It's actually closer to GHC's Core than Haskell.
We're not going to bother with parsing: our input programs will be S-expressions. For example, instead of the following Haskell program:
data List a = Nil | Cons a (List a) map f Nil = Nil map f (Cons x xs) = Cons (f x) (map f xs)
we will write:
(defdata list (a) nil (cons a (list a))) (deffun map ((f nil) nil) ((f (cons x xs)) (cons (f x) (map f xs))))
As you can see, there is no syntactic distinction between variable names and constructor names — when translating these S-expressions, we'll take special care to always register constructor names before processing expressions.
Internally, we'll represent programs in an even more simplified way, by making all function applications explicit. Multi-parameter functions and constructors, of course, will be implemented via schönfinkeling. The syntax tree itself is represented using two class hierarchies: subclasses of expr for expressions, and subclasses of pattern for patterns:
Going from S-expressions to this object-oriented representation is straightforward:
The basic idea behind implementing a lazy functional language using graph rewriting is to represent terms as directed graphs, and reducing a function application is implemented by replacing the application node with the graph corresponding to the right-hand side of the function definition, of course with all the references to formal arguments instantiated to the actual arguments.
Let's look at a simple example first, with no sharing or recursion:
(defvar x (let ((primes (cons 2 (cons 3 (cons 5 nil))))) (map (+ (+ 1 1) primes))))
As you can see, the green boxes represent constructors (with rounded corners for primitive values), yellow ones are functions, and the small gray circles are function applications. There are no variables, since references are resolved when building this graph.
We can simplify this format by omitting application nodes and just pushing arguments under their function node (until it becomes saturated), like this:
Reducing this entails instantiating the following graph (from the second case of map), with f bound to (+ (+ 1 1), x bound to 2, and xs bound to (cons 3 (cons 5 nil)). Note the explicit application node: we won't know the arity of f until it's bound to something.
Resulting in
Note how the first argument to + is shared in the result. Also note how (+ (+ 1 1) 2) is not reduced by map, since it is lazy in its first argument.
Our second example shows recursively-defined terms:
(defvar nats (cons 0 (map (+ 1) nats)))
This is already in WHNF, but we can nevertheless reduce the tail of the list, to get this:
and so on.
Since we want to be able to replace function application subgraphs easily (when reducing those applications), the representation uses an extra level of indirection: gnodes contain the payload: the type of the node and pointers to its children grefs; and each gref contains a single reference to its content gnode. grefs can be shared, so when its content gnode is replaced, all the shared references are updated.
Earlier, we said variables are not present as such in the graph representation, since they are inlined (respecting sharing, of course). But we still have a var-gnode class defined above. The reason for that is simply to mark occurances of formal variables in function definitions, which will be filled in when reducing function applications.
But there's another, more compilcated problem with variables. Our language's let construct is mutually recursive, so we can't just build up the variables' graphs one by one:
(defvar main (let ((zig (cons 0 zag)) (zag (cons 1 zig))) zig))
Of course, not everything can be helped by this:
(defvar silly-list (let ((xs xs)) (cons 1 xs)))
So we'll add a gnode subclass for temporarily storing let-bound variable occurances, and replace them after the fact. Unfortunately, this also causes the code that translates from expr to gnode to become a little spaghetti-y. On the other hand, we don't implement lambda lifting, as it's not strictly essential — the programmer will have to do it by hand, by writing top-level functions instead.
There are three parts to making reductions work: first, a way to do pattern matching against function alternatives. Second, given a mapping from this match, instantiating a function definition by replacing var-gnodes. The third part is orchestrating all of this by taking care of choosing which nodes to reduce.
Pattern matching is a relatively straightforward matter: a variable pattern always succeeds (and binds the subgraph) and a constructor pattern can either succeed and recurse, fail, or (if the actual node is a function application) force reduction. This latter is done via raising a Lisp condition.
Once we have the bindings in the format returned by match-pattern, we can easily instantiate function bodies, we just have to maintain a mapping from old node to new to avoid diverging on cycles.
Actual reduction then becomes just a matter of putting these two modules together. Several functions are provided with varying granularity: reduce-graph tries direct reduction, reduce-graph* catches need-reduce conditions and recurses on those nodes, in effect making sure a single reduction step at the target site can be made; and reduce-to-whnf repeatedly uses reduce-graph* until the head is either a constructor, or a non-saturated function application. simplify-apps is not really a reduction step, it just removes superfluous apply-gnodes.
The rest of the code just keeps a registry of functions and constructors, and defines some primitive functions (note how we need the very simple bool ADT to be a built-in just so we have a return type for >=).
Given an S-expression containing an Alef program, we can transform it into a graph suitable for reduction by doing the following steps:
The nice plots in this blogpost were created by dumping the program's graph in GraphViz format, and running dot on the output. The visualization code is a straightforward traversal of the graph. Note how we store a stable name (generated using gensym) in each node, to ensure a correspondence in the generated GraphViz nodes between reduction steps. In the future, this could be used to e.g. animate the graph to show each reduction step.
We can use this like so:
(in-fresh-context (let ((g (parse-program '((defvar main (string-append "Hello, " "world!")))))) (simplify-apps g) (with-open-file (s "hello.dot" :direction :output :if-exists :supersede) (dot-from-graph g s)) (reduce-to-whnf g) (with-open-file (s "hello-reduced.dot" :direction :output :if-exists :supersede) (dot-from-graph g s))))
Which results in the very uninteresting graphs:
and
SLOCCount tells me the whole code presented in this blogpost is 471 lines of Common Lisp; you can check out the full source code on GitHub. It implements lazy semantics for a pure functional programming language with pattern matching on algebraic datatypes; of course, by changing the definition of reduce-function, we could easily make it strict.
In my next blog post, I'll be adding Hindley-Milner type inference / typechecking. Because of the type-erasure semantics of our language, we could implement our evaluator without any type system implementation, simply by assuming the input program to be well-typed. So all that we'll need is an extra typechecking step between parsing and graph building that either rejects or accepts an Alef program.
About two and a half years ago, a wave of interest in electronics swept across the Budapest office of Intentional. We desperately wanted to create something tangible from first principles. The idea we settled on was to design and eventually build a CPU that uses Brainfuck as its machine language.
Looking back, it really was a case of people with insufficient knowledge trying to use inappropriate tools. But damn if we didn't have fun during the process! After filling a couple of notebooks with sketches, we ended up with an initial design in Logisim. It had horrible timing problems, of course, since too much of it was asynchronous. Before ironing out all the wrinkles, though, I remember Maya pointing at one of the lines tangled up on the screen, and saying "You guys realise this single line will be 16 wires if we actually want to solder this together, right?" So basically we gave up on building the thing. Later on, Maya and Encsé went on to enroll to a bachelor's program in EE as a hobby; and I decided to stick to discrete logic, ordered a bunch of 7400 TTL's and some LEDs and seven-segment displays and started wiring together much simpler circuits on breadboards. I never got to soldering, not to mention getting access to anything that could produce PCB's, Then as I moved to Singapore, I left all my electronics stuff at home, and put the whole electronics thing on the backburner indefinitely.
Then, a couple months ago I discovered the Papilio FPGA platform, which has this collection of nice IO daughterboards (called "wings") that snap right into it, no soldering or even wiring required. I ordered one with the LogicStart IO board which features, among other, more advanced stuff, eight toggle switches and four seven-segment displays. Perfect for my baby steps into the world of FPGA's!
So what else could my Hello World project have had been, than the Brainfuck CPU.
We can use the Harvard architecture: since the Brainfuck language has no reflection capabilities, the program can be stored in ROM with no programmable access. Memory is implemented as a RAM of 32K 8-bit bytes. The CPU also has several internal registers:
Output is implemented by an 9-bit signal: 8 bits of data and an enable bit. When a . opcode is encountered, the CPU sets these 9 bits, and enters a special state until it receives an acknowledgment signal. Input is implemented similarily. On the actual board, the output signals are connected to the seven-segment display, the input signals are fed from the eight toggle switches, and the directional "mini-joystick" is used to acknowledge input/output.
Compared to a normal machine language, it's really just [ and ] that requires special handling. Everything else is just straightforward manipulation of either idx or RAM[idx] via incrementing/decrementing; or pushing data between RAM[idx] and the IO port. [ and ] are tricky because we need to search for their matching pairs, and pre-processing the Brainfuck program to attach pair addresses would be against the spirit of this (self-imposed) challange.
One solution would be to maintain a stack in a separate RAM, and push PC into it whenever a [ is encountered. In that case, ] is a simple matter of popping PC if RAM[idx] does not equal 0. However, here we've basically changed [/] from a while loop to a do while loop. So if RAM[idx] is 0 when we first enter the [, we have to scan the program forward to find its matching ].
For simplicity's sake, I decided not to worry about performance and skip the stack part, and just implement scanning in both directions. Scanning is where the DC register is used (explained here for [, but ] is similar): if the opcode is [ and RAM[idx] is 0, DC is set to 1, and the CPU enters a special skip-forward state from the next opcode. In this state, only [ and ] opcodes have any effect: they increment and decrement, respectively, the DC register. When DC gets to 0, we know we've found the matching ], and so we can go back to the regular fetch-execute cycle.
I originally planned to implement the whole thing in VHDL, and compile that using the Xilinx synthesizer tools (since the Papilio One board uses a Xilinx FPGA chip). However, I've found VHDL to be quite a horrible language from a software programmer's point of view. The whole point, for me, of going from physical chips to FPGA's was to enable using abstractions to manage complexity; so why settle for a poor language like VHDL? Fortunately, there's a whole family of Haskell-embedded DSLs for hardware description called Lava. Of these, Kansas Lava seemed the only one actively maintained, and it already had support for a Xilinx dev board; so adding support for the Papilio was straightforward (see my kansas-lava-papilio package).
The complete code for my Brainfuck CPU (including IO via the LogicStart daughterboard) is available on GitHub. There are quite some rough edges left to file off; I'd say the most pressing is adding the ability to synthesize the program ROM separately from the CPU definition.
This first video shows a simple countdown (actually, count-up) program: ,[>+.<-]. I had to record these slightly out-of-focus, otherwise the seven-segment LEDs were hard to read.
Next up is "Hello world!":
And the final one shows Maya's solution to the 9-digit problem. This one really shows how slow this naïve, stackless implementation is.
One class of problems solved by the software stack we have here at SCB is pricing structured financial products. This requires four components: some data structure describing the details of the actual transaction (e.g. the strike and the maturity date of an option), market data (e.g. spot prices), a pricer configuration (e.g. the pricing date and the currency in which the value is requested), and a computation method for the given type of transaction. This latter part is basically a function mapping the trade data, the market data, and a pricer configuration to a value:
-- These types will be used in the running example data Ccy data Date data Config = Config{baseCcy :: Ccy, pricingDate :: Date} -- The kinds of market observations we are interested in data MarketData α where FXSpot :: Ccy -> Ccy -> MarketData Double IRDiscount :: Ccy -> MarketData Double -- A market is the collection of market data data Market getMarketData ∷ Market -> MarketData α -> α -- We will define several versions of this datatype type Pricer₁ = Market → Config → Double -- Our running example will be a (discounting) pricer for payments data Payment = Payment{ccy ∷ Ccy, notional ∷ Double, date ∷ Date} discount ∷ Floating α ⇒ α → α → α → α discount τ df x = x * (1 + df) ** (-τ) pricer₁ ∷ Payment → Pricer₁ pricer₁ Payment{..} market Config{..} = discount (date - pricingDate) df notional′ where notional′ = notional * getMarketData market (FXSpot ccy baseCcy) df = getMarketData market (IRDiscount baseCcy)
Of course, the type of data needed from the market depends on the choice of valuation function, and in practice, it would be a really poor idea to assemble a kind of super-total market that contains everything from the Woolong vs. Latinum FX spot price to volatility of Buttfuckistani dirt just to valuate a payment in USD. One way around it is to have pricing happen in the IO monad, so that valuation functions can load market data (from external sources) as needed:
type Pricer₂ = (∀ α. MarketData a → IO α) → Config → IO Double
Of course, we don't want pricers to do arbitrary IO — for example, for the more complex pricers, we might want to run them on the grid; so let's hide the fact that market data is loaded in IO:
type Pricer₂′ = ∀ μ. Monad μ ⇒ (∀ α. MarketData α → μ α) → Config → μ Double pricer₂ ∷ Payment → Pricer₂′ pricer₂ Payment{..} loadMarketData Config{..} = do spot ← loadMarketData $ FXSpot ccy baseCcy let notional′ = notional * spot df ← loadMarketData $ IRDiscount baseCcy return $ discount (date - pricingDate) df notional′
However, this is still not perfect. It is much more efficient to bulk-fetch the market data; furthermore, risk calculations often require repeated pricings with slightly modified markets, where only the numbers change, not what is actually in the market. So you want to be able to assemble a market containing just the right data for a given pricer function for the given trade, and then regard the pricer as a pure function.
Our first idea might be to request market data explicitly, separate from the actual pricing algorithm:
data MarketKey = ∀ α. MarketKey (MarketData α) type Pricer₃ = Config → ([MarketKey], Market → Double) pricer₃ ∷ Payment → Pricer₃ pricer₃ Payment{..} Config{..} = (deps, price) where deps = [MarketKey $ FXSpot ccy ccyBase, MarketKey $ IRDiscount ccyBase] price mkt = discount (date - pricingDate) df notional′ where spot = getMarketData mkt $ FXSpot ccy ccyBase notional′ = notional * spot df = getMarketDate mkt $ IRDiscount ccyBase
But this is both cumbersome to use (you have to explicitly list your dependencies), and also unsafe (what if pricer₃ returned an empty dependency list? It would still be accepted by the type checker...)
Can we do better?
The idea is to write pricers using a type exposing only an applicative interface. The implementation of this type will allow us to determine, without doing any actual computation, the list of dependencies.
data P α instance Applicative P get ∷ MarketData α → P α runP ∷ ∀ μ. (Monad μ) ⇒ ([MarketKey] → μ Market) → P α → μ α dependencies ∷ P α → [MarketKey] -- To be used by the implementation of runP type Pricer₄ = Config → P Double pricer₄ ∷ Payment → Pricer₄ pricer₄ Payment{..} Config{..} = discount (date - pricingDate) ⟨$⟩ df ⟨*⟩ notional' where notional' = (notional *) ⟨$⟩ get (FXSpot ccy baseCcy) df = get $ IRDiscount baseCcy
Since get has type MarketData α → P α, and no monad interface is exposed for P, it is statically enforced that what keys you get from the market can only depend on the trade details, not on data previously retrieved from the market.
So how do we define P and implement Applicative, get, dependencies and runP?
The key insight is that the Functor and Applicative instances are not really supposed to do anything. So I came up with the concept of free applicative functors, which we can use with a datatype that statically tracks the requested MarketKeys while accumulating the pure post-processing functions. The module Control.Applicative.Free contains functions both for static analysis and evaluation (in some other applicative functor) of computations built using the Applicative interface; here, we use the Identity functor, since both collecting the results and running the pricer (once the market is loaded) are pure operations.
import Control.Applicative import Control.Applicative.Free import Control.Monad.Identity data MarketRequest a = forall b. MarketRequest (MarketData b) (b -> a) instance Functor MarketRequest where fmap f (MarketRequest req cont) = MarketRequest req (f . cont) type P = Free MarketRequest get ∷ MarketData α → P α get key = effect $ MarketRequest key id dependencies ∷ P α → [MarketKey] dependencies = runIdentity ∘ analyze (Identity ∘ collect) where collect :: MarketRequest a -> [MarketKey] collect (MarketRequest key _) = [request key] runP ∷ ∀ μ. (Monad μ) ⇒ ([MarketKey] → μ Market) → P α → μ α runP loadMarket pricer = do let deps = dependencies pricer mkt ← loadMarket deps return $ runIdentity ∘ eval (Identity ∘ step) $ pricer where step ∷ MarketRequest a → a step (MarketRequest key cont) = cont $ getMarketData mkt key
I think the above approach of using free applicatives can be useful in a lot of similar situations; I've already used it in another project at work beside the pricer API briefly outlined here.
I should also mention that the free arrows stuff came from the same thought process that led me to the solution presented here.
In my prevous post, I described a zipper-like representation for modular counters. That representation was well-suited for when your only operations are incrementing and decrementing. However, it'd be really cumbersome to do more complex arithmetics with it.
Then the next week, Peter Diviánszky told me about all the cool stuff that was presented at the latest Agda Implementors' Meeting, among them, the Quotient library. I still remember the time when, as a practice, I tried to implement integers in Agda without looking at the standard library. I came up with something much like the library representation, except mine had a separate constructor for zero (so I had +1+ n and -1- n). I really hated how I had to shift by one at least one of the cases to avoid ending up with two representations of zero. If only there was a way to tell Agda that those two representations actually mean the same thing...
Quotient types promise to do just that: you define a type where the constructors may be redundant so that there may be several values that should have the same semantics, and then you divide it with an equivalence relation so that the new type's values are the equivalence classes. See this example defining integers as pairs of natural numbers such that the pair (x, y) represents x-y.
I wanted to try doing the same for a proper modular integer type, by factoring integers with the equivalence relation x ∼ y ⇔ n ∣ ∣x-y∣. The point is, you take the integers, define this relation, then prove that it is indeed an equivalence (i.e. it is reflexive, symmetric and transitive), in other words, you create a setoid; then you use the Quotient type constructor to create your set-of-equivalence-classes type. After that's done, you can define functions over this quotient type by defining them over representations, and proving well-definedness, i.e. that the function maps equivalent representations to the same result.
This last step can be needlessly cumbersome when defining either non-unary functions or endomorphisms, so first of all I created a small library that makes it easier to define unary and binary operators over a quotient type. For example, to define a binary operator, all you need is a binary operator on the representation set, and a proof that the operator preserves the equivalence, thereby being agnostic to the choice of representant on both arguments:
lift₂ : (f : Op₂ A₀) → f Preserves₂ _≈_ ⟶ _≈_ ⟶ _≈_ → Op₂ (Quotient A)
So after writing loads and loads and loads of arithmetic proofs on divisibility of absolute values, like n ∣ ∣x∣ ∧ n ∣ ∣y∣ ⇒ n ∣ ∣x + y∣, I was finally ready to define modular addition:
Mod₀ : ℕ → Setoid _ _ Mod₀ n = {!!} Mod : ℕ → Set Mod n = Quotient (Mod₀ n) plus : ∀ {n} → Mod n → Mod n → Mod n plus {n} = lift₂ _+_ proof where proof : ∀ {x y t u} → n ∣ ∣x - y∣ → n ∣ ∣t - u∣ → n ∣ ∣(x + t) - (y + u)∣ proof = {!!}
Of course, the meat of the work was in actually defining Mod₀ and proof above. But after that, we can get back our old increment/decrement functions as very simple and straightforward definitions:
_+1 : ∀ {n} → Mod n → Mod n _+1 = plus [ + 1 ] _-1 : ∀ {n} → Mod n → Mod n _-1 = plus [ - (+ 1) ]
And proving that _+1 and _-1 are inverses of each other comes down to the very simple arithmetic proof (on vanilla integers!) that
pred-suc : ∀ x → ℤpred (ℤsuc x) ≡ x
Of course, much more properties need to be proven. The end goal of this project should be to prove that Mod n is a commutative ring; a much more ambitious project would be proving that Mod p is a field if p is prime. Unfortunately, on my machine Agda takes more than two minutes just to display the goal and context in the following hole:
plus-comm : ∀ {n} → (x y : Mod n) → plus x y ≡ plus y x plus-comm {n} x y = {!!}
so this is a problem I'll have to find a workaround for before going on. But at least I have my counters, so I can at least get back to my original goal and work on the register machine semantics. Expect the next post to be about that.
You can browse the full source code here, and track it on GitHub.
First of all, before I start on the actual blog post, let me put this in context. I rembember a couple of years ago when I developed an interest in functional programming languages, and Haskell in particular. There was a phase when I was able to use Haskell to solve problems in the small. I understood most of the basics of pure functional programming; then there were things I regarded as magic; and of course there was a lot of things I didn't even know that I didn't know about. But none of it did I grok.
I feel like I'm starting to get to the same level with Agda now. So this is going to be one of those "look at this cool thing I made" posts where the actual result is probably going to be trivial for actual experts of the field; but it's an important milestone for my own understanding of the subject.
I wanted to play around with simple but Turing-complete languages, and I started implementing an interpreter for a counter machine. More on that in a later post; this present post describes just the representation of register values. In the model that I implemented, values of registers are byte counters, meaning they have 256 different values, and two operations +1 and -1 that the inverses of each other. Incrementing/decrementing should roll over: 255 +1 = 0 and 0 -1 = 255.
My first approach was to just use the Fin type from the standard library. However, the structure of Fin is nothing like the structure imposed by +1 and -1, so while one can define these functions, proving properties like -1 ∘ +1 = id is unwieldy and the resulting proofs are not easy to reuse in other proofs.
So I eventually settlend on a zipper-like representation. The intuition behind it is to think of the possible values of Counter (suc n) as points on the discrete number line from 0 to n. You have a vector of numbers behind you and a vector of numbers in front of you; with the invariant that the length of the two vectors is always n. For example, if n=3, you can be at positions ([], [1, 2, 3]), ([1], [2, 3]), ([1, 2], [3]) and ([1, 2, 3], []). To increase the value, just move the leftmost item of the second vector to the end of the first one; rollover is handled by the simple syntactic rule (xs, []) ↦ ([], xs).
Of course, there is no point in actually storing the numbers, so we can use vectors of units instead; but why store those if we only care about their length?
So the eventual representation I came up with was:
data Counter : ℕ → Set where cut : (i j : ℕ) → Counter (suc i + j)
I was hoping that I could write +1 and -1 like this:
_+1 : ∀ {n} → Counter n → Counter n cut i zero +1 = cut zero i cut i (suc j) +1 = cut (suc i) j _-1 : ∀ {n} → Counter n → Counter n cut zero j -1 = cut j zero cut (suc i) j -1 = cut i (suc j)
But life with indexed types is not that simple: for example, in the first case, the left-hand side has, by definition, type Counter (suc i + 0) and the right hand Counter (suc 0 + i). So we also need to inject proofs that the types actually match (with the actual proofs p_{1} and p_{2} omitted here for brevity):
_+1 : ∀ {n} → Counter n → Counter n cut i zero +1 = subst Counter p_{1} (cut zero i) cut i (suc j) +1 = subst Counter p_{2} (cut (suc i) j) _-1 : ∀ {n} → Counter n → Counter n cut zero j -1 = subst Counter (sym p_{1}) (cut j zero) cut (suc i) j -1 = subst Counter (sym p_{2}) (cut i (suc j))
However, this leads to more problems further down the line: you can't get rid of that subst later on, thus forcing you to use heterogenous equality for the rest of your proofs. While I was able to prove the property
+1-1 : ∀ {n} → {k : Counter n} → k +1 -1 ≡ k
using heterogenous equality, it broke down on me further down the road when actually trying to use these counters in the semantics of my register machines.
So instead of storing the size of the counter in a type index, I used a type parameter. This requires carrying around an explicit proof that the sizes match up, but we needed those proofs for the indexed case in the subst calls anyway, and then invoke proof irrelevance in the proof of +1-1:
data Counter (n : ℕ) : Set where cut : (i j : ℕ) → (i+j+1=n : suc (i + j) ≡ n) → Counter n _+1 : ∀ {n} → Counter n → Counter n (cut i zero i+1=n) +1 = cut zero i p_{1} (cut i (suc j) i+j+2=n) +1 = cut (suc i) j p_{2} _-1 : ∀ {n} → Counter n → Counter n (cut zero j j+1=n) -1 = cut j zero p_{3} (cut (suc i) j i+j+2=n) -1 = cut i (suc j) p_{4} +1-1 : ∀ {n} → {k : Counter n} → k +1 -1 ≡ k +1-1 {k = cut i zero _} = cong (cut i zero) (proof-irrelevance _ _) +1-1 {k = cut i (suc j) _} = cong (cut i (suc j)) (proof-irrelevance _ _) -1+1 : ∀ {n} → {k : Counter n} → k -1 +1 ≡ k -1+1 {k = cut zero j _} = cong (cut zero j) (proof-irrelevance _ _) -1+1 {k = cut (suc i) j _} = cong (cut (suc i) j) (proof-irrelevance _ _)
With this approach, lifting these theorems to be about whole states, not just individual register values, is a breeze, e.g.:
+1-1Σ : ∀ {Σ x y} → (getVar y ∘ decVar x ∘ incVar x) Σ ≡ getVar y Σ +1-1Σ {x = x} {y = y} with toℕ x ≟ toℕ y ... | yes x=y = +1-1 ... | no x≠y = refl
But this is takes us to my actual application for these counters; and that will be the topic of a next post.
Here are the complete sources of the two counter implementations:
I couldn't find a step-by-step tutorial explaining the process of developing iOS applications in Haskell, so after finally getting "Hello World" to run on an iPhone, I decided to write this tutorial. I should also credit Lőry, who did the iOS side of the work.
The basic overview of what we're going to do in this tutorial is the following:
For this tutorial, we will simulate an intricate Haskell backend with the a simple function. For your real application, this is where you go all out with your Haskell-fu.
module Engine where import Data.Char (ord) engine :: String -> Either String [Int] engine s | length s < 10 = Left "String not long enough" | otherwise = Right $ map ord s
To interface our backend with the frontend developed in Objective-C (or C or C++ or...), we need to represent the input and output of our Haskell function in terms of simple C types. For the function engine, a straightforward API would be, in pseudo-C:
bool engine (in string s, out string error, out int[] result)
Of course, we have to use char *'s for strings, and pointers for out parameters and arrays, so our real API will be:
int engine (char* s, char* *error, int* *result, int *result_length)
with engine returning 0 on success (Right) and non-zero on failure (Left).
The Haskell FFI representation of this signature is:
foreign export ccall "engine" engineC :: CString -> Ptr CString -> Ptr (Ptr CInt) -> Ptr CInt -> IO CInt
The next step requires us to actually define engineC that does all the necessary marshalling. We simply evaluate engine and then set the appropriate out-parameters.
module Engine.FFI (engineC) where import Foreign import Foreign.C import Control.Monad (zipWithM_) foreign export ccall "engine" engineC :: CString -> Ptr CString -> Ptr (Ptr CInt) -> Ptr CInt -> IO CInt engineC s ptrErr ptrptrResult ptrLen = do s' <- peekCString s case engine s' of Left err -> do cErr <- newCString err poke ptrErr cErr return 1 Right result -> do pokeList ptrptrResult ptrLen $ map fromIntegral result return 0 pokeList :: Storable a => Ptr (Ptr a) -> Ptr CInt -> [a] -> IO (Ptr a) pokeList ptrptrList ptrLen xs = do let len = length xs ptr <- mallocBytes $ len * elemSize let ptrs = iterate (`plusPtr` elemSize) ptr zipWithM_ poke ptrs xs poke ptrptrList ptr poke ptrLen $ fromIntegral len return ptr where elemSize = sizeOf $ head xs
The next step is compiling our Haskell project into C, so that we can use Apple's SDK to compile that for the iPhone, and also call engine from other code, like the Objective-C parts that make up the frontend.
Unlike GHC, Jhc doesn't compile individual modules. Instead, it compiles every used definition (but only those) and the runtime into a single C source file. Although we are not going to run our Haskell program directly, and instead call to it from the frontend, Jhc still needs a main function in the source code. So let's create a Main module which we will compile with Jhc:
module Main where import Engine.FFI main :: IO () main = return ()
We can compile this module into a C file containing the code for engineC and everything else it uses (including imported packages):
jhc -fffi -fjgc --cross -mle32 -C EngineMain.hs -o EngineMain.jhc.cThe -fffi flag turns on FFI support and makes Jhc generate the engine function from the foreign export declaration and the definition of engineC. Note that there is no name clash between engine the C function (defined as engineC in Haskell-land) and engine the Haskell definition. I think in this particular example it is cleaner to use the same name for both.
The -fjgc flag generates GC code. Note that we will also need to enable the GC code in the next step, when compiling the C sources.
The --cross -mle32 flags are important because they instruct Jhc to target little-endian, 32-bit CPUs which is what the ARM is.
Everything up to this point can be done without Apple's SDK, and in fact you can run Jhc on any platform you wish. From here on, however, we will use the iOS SDK to compile to ARM.
To compile EngineMain.jhc.c, we first need to set some preprocessor macros:
You also need some important C compiler flags (you can ignore the warning settings if you'd like):
-std=gnu99 -falign-functions=4 -ffast-math -fno-strict-aliasing -marm -Wextra -Wall -Wno-unused-parameter-marm is very important because otherwise, GCC (or Clang) and Jhc step on each other's toes, leading to strange crashes seemingly out of nowhere.
You can use the standard SDK to create the frontend; I will not cover that here in detail. You also need to create a header file containing the signature of our exported function. The code generated by Jhc also contains initialization and finalization routines that need to be called before and after calling any functions defined in Haskell:
extern void hs_init (int *argc, char **argv[]); extern void hs_exit (void); extern int engine (char* s, char* *msgError, int* *result, int *len);
You also need to manage the memory returned by the backend. The call to mallocBytes in the marshalling code is compiled into vanilla malloc, so you can simply call free after you're done with it.
To make initialization and memory management easier, Lőry has created a sample XCode project that wraps the C API to use Objective-C types. You can find the tarball here.
This is an based on a chapter of the M.Sc. thesis I am writing at ELTE, supervised by Péter Diviánszky.
For my M.Sc. thesis, I've been working on writing a compositional type checker for Haskell 98. The basic idea is to extend Olaf Chitil's compositional type system with ad-hoc polymorphism, Haskell 98's major extension to the Hindley-Milner type system. In this post, I'm showing the motivation behind wanting to go compositional.
A property shared by both commonly-used algorithms for doing Hindley-Milner type inference, W and M, is that both W and M infer the type of composite expressions by inferring one subexpression (in some sense, the “first” one) and using its results in inferring the type of the “next” one. They are linear in the sense that partial results are threaded throughout the type inference.
The effect of linearity on type inference is that certain sub-expressions (those that are processed earlier) can have greater influence on the typing of other subexpressions. This is bad because it imposes a hierarchy on the subexpressions that is determined solely by the actual type checking algorithm, not by the type system; thus, it can lead to misleading error messages for the programmer.
For example, let's take the following definition of a Haskell function:
foo x = (toUpper x, not x)There are two ways to typecheck this definition using W: either we first typecheck toUpper x, using the context {x :: α}, resulting in the type equation α ~ Char, then checking not x with {x :: Char}, or do it the other way around, by first looking at not x, then as a result recursing into toUpper x with the context {x :: Bool}.
GHC, it seems, does the former, resulting in the following error message:
Couldn't match expected type `Bool' against inferred type `Char'Whereas Hugs 98 does the latter:
ERROR "test.hs":1 - Type error in applicationThe problem is that they are both misleading, because there is nothing wrong with either not x or toUpper x by itself. The problem only comes from trying to unify their respective views on the type of x.
A compositional type checker, in contrast, descends into toUpper x and not x using the same context, {x :: α}. The first one results in the typing (which is defined to be not just the type of an expression, but also a mapping of monomorphic variables to their types) {x :: Char} ⊢ Char, and the second one in {x :: Bool} ⊢ Bool. Only afterwards are these two typings tried to be unified.
This is better because it becomes meaningful to talk about the typing of a subexpression. For the example above, my work-in-progress compositional type checker can report errors with an (IMO) much more helpful message:
input/test.hs:1:8-25:Of course, the devil's in the details — but that's what my thesis will be about.
This is a continuation of my previous post on register machines vs. Brainfuck programs. We left off at Brainfuck's supposed Turing-completeness.
Now, the most straightforward way to prove Turing-completeness of a given language is to write a compiler that takes a program written in a language that is already known to be Turing-complete, and creates a program written in the language to be proved Turing-complete, that simulates the original program. So an obvious way to prove that Brainfuck is a Turing-complete language is to compile register machine programs into Brainfuck. This has the added advantage that a programmer having some experience in real-world assembly programming can easily write register machine programs, which can then be compiled into (horribly inefficient and over-complicated, as we'll see) Brainfuck programs.
Important note: Of course, to really prove, in a mathematical sense, that Brainfuck is Turing-complete, we would first have to define formal operational semantics for register machines and Brainfuck programs to be even able to argue about simulating one with the another. In this post, I will appeal to intuition instead.
So how does one simulate a register machine (RM for short) using Brainfuck? The first idea is that since a given RM program can only reference a finite number of variables, we can lay them out in the linear array of memory cells provided by the Brainfuck model. So we can assign e.g. cell number #0 to a, #1 to b and #2 to z, and any operation working on z first increments the pointer twice (i.e. >> in Brainfuck notation), then does something, then decrements the pointer twice (<<) to get it back to the initial state. So for the line
clr z,we can write
>>[-]<<
Similarly, we can compile
dec bto
>-<In fact, to make further work simpler, we can devise an intermediate language that has constructs similar to Brainfuck, but that uses named registers instead of a linear array. The language called Loop has the following statements:
Once all the registers are laid out in the linear memory, compiling this to Brainfuck is trivial.
As we've previously noted, the other major difference between RM and Brainfuck is that Brainfuck programs can't directly control their execution sequence. If your Brainfuck program contains "<++" at the next position, you can be 100% sure that the pointer will move left and then increment the cell twice, and there is nothing you can do about it. Contrast this with RM's jmp and jz instructions that can change the statement that gets executed next.
To reconcile this difference, the key idea is to start thinking about RM programs in a different way. Instead of a sequential list of instructions with possible jumps between, let's look at it as an n-ary branch switching on some special register called a Program Counter. So for the following program that adds a to b:
we can also imagine it as the following program, written in some unspecified pseudo-code:
pc ← 1 while pc ≠ 11 loop switch pc case 1: clr tmp pc ← 2 case 2: if a = 0 then pc ← 7 else pc ← 3 case 3: dec a pc ← 4 case 4: inc z pc ← 5 case 5: inc b pc ← 6 case 6: pc ← 2 case 7: if z = 0 then pc ← 11 else pc ← 8 case 8: dec z pc ← 9 case 9: inc a pc ← 10 case 10: pc ← 7 end loop
At first glance, we don't seem to be any closer to our goal, since now we have to implement if and switch in Loop. First, let's observe that it makes no difference if several values of the pc register are handled in a single iteration of the outermost loop. Using this observation, and getting rid of some superfluous partitioning of statements, the above can be rewritten as the following:
pc ← 1 while pc ≠ 11 loop if pc = 1 then clr tmp pc ← 2 if pc = 2 then if a = 0 then pc ← 7 else pc ← 3 if pc = 3 then dec a inc tmp inc b pc ← 2 if pc = 7 then if tmp = 0 then pc ← 11 else pc ← 8 if pc = 8 then dec tmp inc a pc ← 7 end loop
We've eliminated the need for switch, and all our if branches fall in one of the following two categories:
The first kind of test we can simulate by using not just one pc register, but one for each possible value of pc, taking values of 0 or 1. So we enforce the invariant that pc_{i} is 1 iff the virtual pc register equals i. Then we can use while loops for branching by testing for pc_{i} and then immediately after, decrementing it, thus ensuring that the loop runs at most once. The above program thus becomes:
dec pc_{11} inc pc_{1} while pc_{11} loop while pc_{1} loop dec pc_{1} clr tmp inc pc_{2} end loop while pc_{2} loop if a = 0 then inc pc_{7} else inc pc_{3} end loop ... while pc_{8} loop dec pc_{8} dec tmp inc a inc pc_{8} end loop end loop
Note the special handling of pc_{11} which gets decremented first, to -1, so that incrementing it later exits the main loop.
We are inching closer and closer to our destination – we just need a way to increment one of two pc registers based on the value of some other, non-pc register. Solving this requires some trickery because we can only use loops for testing if a given register is zero, but then we have to zero it out unless we want to get into an infinite loop. The solution is similar to what we do in our original adding example, by using a separate register as temporary storage. Suppose we want to translate the following piece of code:
if a = 0 then inc pc_{7} else inc pc_{3}
Using a temporary buffer, it is possible to run two loops that by the end preserve the register a's initial value, but allow us to change other registers in the process. We will use two special-purpose registers Z and NZ to signal if the value of a is zero or non-zero. First, we set up Z:
inc Z inc NZ while a loop dec a inc Buffer clr Z end loop while Buffer loop dec Buffer inc a end loop
By that point, a retained its original value, but Z is 1 iff a is zero at the start. So now we can discriminate between the two cases using yet more loops:
while Z loop dec Z inc pc_{7} dec NZ end loop while NZ loop dec NZ inc pc_{3} end loop
Note how the loop for Z decrements NZ, thereby preventing the other branch from running.
We've now arrived at a valid Loop program, which can readily be translated into a Brainfuck program. I've implemented an RM → Loop → Brainfuck compiler using the above scheme in my Brainfuck toolbox.
One surprising aspect of the above is that the resulting Brainfuck program, while hideously complicated and large, doesn't perform that bad. Maya was kind enough to write a register machine program solving the 9-digit problem (source here), and I compiled it into x86 assembly via the Brainfuck route, to compare it with his native Brainfuck solution. Let's look at program size first: the native one is 4,591 instructions long, and the one compiled from RM comes in at a whopping 480,466 instructions. However, both implementations showed runtime performance in the same order of magnitude.
Unfortunately, I don't have a corpus of algorithms implemented in both RM and Brainfuck lying around, so I can't do any real benchmarks. But compared to my initial expectations, the result of the 9-digit program is promising: I figured this whole RM → Brainfuck compiler scheme would turn out to be a strictly theoretical result, creating Brainfuck programs that are so slow to be completely impractical.
Epilogue: I wanted to write some Agda-checked proofs that the compiler actually generated equivalent programs. As it turned out, this is not so easy. I hope I'll have time to get back to this problem soon.
The Brainfuck programming language stays a somewhat current topic at the office ever since Maya's 9-digit program, so much so, that now we've even started designing our own Brainfuck computer using first principle logic gates only. But more on that later. Today and in my next post, I'm going to write about compiling register machines to Brainfuck.
The register machine is an abstract mathematical machine much like Turing machines or finite automata, and it can be shown to be Turing-complete. On the other hand, it models very closely how real-world computers operate. The program of a register machine (RM for short) is a labelled list of instructions, each operating on one of a finite number of memory cells called registers, holding values of natural numbers. The allowed instructions in the specific formulation that we'll be using here is:
Note that this set of operations is redundant in that both clr and mov can be easily implemented in terms of the others — they are included here only for clarity later on.
For example, to add the contents of register a to b, we can write the following program using a temporary register z:
The Haskell code for parsing register machine programs and an instantiator for a simple little macro language is available on GitHub.
Brainfuck is a joke programming language, one designed to be as much of a pain to use as possible, while also being powerful enough to solve real problems with it (languages like that are commonly called Turing tar-pits). The link above explains the intentionally minimalistic syntax and the semantics of the language in detail. The sketchy version is that there is a linear array of memory cells and a pointer moving on this array, and the operations can move the pointer and change the contents of the currently pointed cell:
You can find my parser, interpreter and x86 compiler for Brainfuck here.
In the next post, we will prove that Brainfuck is Turing-complete by translating register machine programs into Brainfuck. As you can see, there are two differences between RM and Brainfuck: one is that RM uses random access for the registers, whereas Brainfuck can access only one register at a time; the other, more major one is that you can't change the order of execution arbitrarily in Brainfuck. This is why we will have to use some trickery to come up with a translation scheme.
If you're feeling impatient, you can, of course, take a peek in the meantime at my RM → Brainfuck compiler in GitHub.
Another approach I wanted to explore to Encsé's 9-digit problem was creating a type-level solution in Haskell. First I'll describe what that would entail, and then present a solution to one of its sub-problems: testing special cases of divisibility in a more efficient way than the naïve implementation of repeating subtractions.
Basically, a type-level Haskell program is to regular Haskell programs what C++ template metaprograms are to C++ proper. Conrad Parker wrote a great introduction to type-level programming using functional dependencies in issue 6 of The Monad.Reader journal, titled Type-Level Instant Insanity. You encode values as types, and then construct an expression whose type, as inferred by the Haskell compiler, encodes the output of your computation. To actually implement the equivalent of a function, typeclasses are used.
For example, take the following (value-level) Haskell definition of length: (deliberately not using any builtin types, and using unary representation for simplicity):
module ValueLength where data Nat = Z | S Nat data List a = Nil | a ::: (List a) infixr 6 ::: length :: List a -> Nat length Nil = Z length (x ::: xs) = S (length xs)
(note that due to technical limitations, both : and :: are reserved names in Haskell, and so we have to use ::: for our cons)
To lift this definition to the level of types, we first need a way to represent actual numbers and lists as types. This is pretty straightforward using algebraic datatypes. Since we are ultimately only going to be interested in the type of expressions and not their value, no constructors are defined.
{-# LANGUAGE EmptyDataDecls, TypeOperators, FunctionalDependencies, MultiParamTypeClasses, UndecidableInstances #-} module TypeLength where data Z data S n data Nil data x ::: xs infixr 6 ::: – Continued below
Of course, System F doesn't allow for meta-types, only kinds; so we can't prohibit meaningless types like S Nil or Z ::: Z (contrast this with the declaration of the S constructor in the first example). Thus, type-level programming in Haskell has to be untyped.
Now comes the clever bit: lifting the definition of length. Basically, what we are looking for is a way to map types to types. Typeclasses with functional dependencies are one possibility of achieving this (type families are another one). First of all, each n-ary operator is rewritten into an n+1-ary relation between its operands and its result: instead of thinking about length Nil = 0, we say Nil and 0 are in the relation Length. We declare a class for each function, and one instance for each equation:
class Length xs n | xs -> n instance Length Nil Z
The xs -> n part of the class declaration is where the magic comes from. It means that if there are types xs, n_{1} and n_{2} such that there are instances of Length xs n_{1} and Length xs n_{2}, then the types n_{1} and n_{2} must be the same. To demonstrate why this is useful, consider the following function definition:
liftedLength :: Length xs n => xs -> n liftedLength = undefined
What is the type of liftedLength (undefined :: Nil)? The declaration of liftedLength says that the answer is whatever type n such that an instance of Length Nil n exists. But the functional dependency of Length says that there can be at most one such n. Consulting the list of instances above, we find instance Length Nil Z, and thus, n has to be the type Z. We can test this using the interpreter GHCi:
*TypeLength> :t liftedLength (undefined :: Nil) liftedLength (undefined :: Nil) :: Z
So far, so good. But what about the length of non-empty lists?
To translate the other equation of length, we need a way to tackle recursion. Viewing the recursive definition of length through our relational glasses, it is simply an inductive definition of the relation Length:
instance (Length xs n) => Length (x ::: xs) (S n)
Again, we can try this out with GHCi, by asking for the length of [0, 0]:
*TypeLength> :t liftedLength (undefined :: Z ::: Z ::: Nil) liftedLength (undefined :: Z ::: Z ::: Nil) :: S (S Z)
Now that the basics are covered, it should be clear what a type-level solution to the original 9-digit problem should entail. I've decided to directly translate Lőry's solution because it doesn't use any fancy infrastructure like list monads. Take a moment to look at its code because I'm going to explain what I had to change to write its type-level counterpart.
The first version, using unary arithmetics like in the introduction above, was, unsurprisingly, unusably slow (anything beyond the 3-digit sub-problem of the 9-digit problem was hopeless), so I looked around for something better. Péter Diviánszky pointed me to the type-level package, which, promising enough, features decimal arithmetics.
However, the direct transliteration of the value function that converts from a list of digits into its decimal value, and testing for divisibility using type-level's DivMod both proved to be highly inefficient. So instead, I decided to write a faster divisibility checker that exploits the fact that for the 9-digit problem, the divisor is always less than the base of the representation (since we're using base-10, and checking for divisibility by 2, ..., 9).
The basis of the algorithm is how you do division with pen & paper: you take the most significant digit, calculate its reminder, then glue this in front of the next digit, and so on. Here's the Haskell code for a very simple implementation of this idea:
q `divides` [] = True q `divides` (0:ds) = q `divides` ds q `divides` (d:ds) | d ≥ q = q `divides` ((d-q):ds) | otherwise = case ds of [] -> False (d':ds) -> q `divides` ((d * 10 + d'):ds)
The repeated subtraction part may seem scary, but we're only ever glueing digits less than q in front of the next digit, so at most 9 subtractions are needed to eliminate a digit.
To actually lift this algorithm to the world of types requires writing quite a bit of boilerplate code, because the most efficient way to branch on d ≥ q is to write out all the 9^{2} cases. Fortunately, my girlfriend just wrote a report on Template Haskell for a university assignment the other day, so I learned of its existence and was able to use it to generate most of the instances of my Divides class.
Using this special-case division algorithm, I was able to push the program to 7 digits, but solving the 9-digit problem proper is still out of reach, as GHC eats up all my memory before dying (but I'm curious if it will work, unchanged, on my 7-Gig box at work). But the 7-digit problem is still roughly 10,000 times larger than the 3-digit one, so the improvement is evident.
You can check out the full type-level 9-digit program (including the fast divisibility checker) on GitHub, at http://github.com/gergoerdi/typeprog.
After Lőri posted his C++ template meta-program solution to the 9-digit problem, Encsé asked me to help him understand it, being the first real-world metaprogram he's encountered. To help both him and myself, I rewrote, function-by-function, his code into Haskell.
This rewriting process was surprisingly straightforward, and the resulting code, while being in a very strict one-to-one correspondence with the original C++ code, was immensely more readable. So this, naturally, lead to the idea of a simple functional programming language that could be compiled into C++ compile-time metaprograms.
A couple of days later, I had a small accident and had to stay in bed for a week, which gave me plenty of time to throw together a proof-of-concept compiler from a Haskell-like language into C++ TMP.
And so MetaFun was born.
MetaFun is structured into three separate modules, one for parsing and typechecking the input language called Kiff (for Keep It Fun & Functional), one for unparsing C++ metaprograms, and the third one is, of course, the compiler itself. I plan to re-use the Kiff module whenever I come up with an idea for another new throwaway functional compiler.
My coworker Encsé posted a challange use the 9-digit number problem to demonstrate interesting programming techniques. So I sat down and wrote my first practical program using dependant types, in Agda. I've been playing around with Agda previously, but this seemed like a good opportunity to try to write an actual, self-contained, provably correct program with it.
I'm not going to get into details now about the resulting Agda code, because I'm planning to present it in detail in a later post. In its current form, my program is formally proven to produce only good nine-digit numbers; the only property that still needs proving is that it finds all of them.
But the sad surprise came when I tried to actually run it. It was unbearably slow to just get to all possible combinations for the first three digits. I'm talking about 12 minutes just to list them from 123 to 423 (at which point I killed the process). For comparison, the following Haskell program, which is an implementation of the same naïve algorithm, finds the (unique) solution in 4 milliseconds:
import Control.Monad.List fromDigits = foldr shift 0 where shift d s = 10 * s + d p `divides` q = q `mod` p == 0 encse :: [Int] encse = map fromDigits $ encse' 0 [] where encse' 9 ds = return ds encse' n ds = do d <- [1..9] let ds' = d:ds n' = n + 1 guard $ not (d `elem` ds) guard $ n' `divides` fromDigits ds' encse' n' ds'
The first intuition would be that the code generated by Agda is slow because in parallel to the actual computation, it is also evaluating all kinds of proofs. But the proofs exist only in the world of types, so they shouldn't matter once the program is compiled.
The real answer is that calculating in unary representation is slow. Very, very slow. Even slower than you'd imagine.
Of course, Agda uses Peano arithmetic because it's easy to argue about function properties in unary. But it comes with a huge cost in performance: operations have run-times proportional to their input itself, not its order of magnitude. Even converting an unary number to a human-readable string becomes an expensive operation, as the following Agda program demonstrates:
module ShowNat-Native where open import Data.Nat open import Data.Nat.Show open import Data.List open import IO import Data.Colist numbersBelow : ℕ → List ℕ numbersBelow zero = [] numbersBelow (suc n) = numbersBelow n ++ [ n ] main = run (mapM′ putStrLn (Data.Colist.fromList (Data.List.map show (numbersBelow 500))))
If you try out this program yourself (I've been using version 0.2 of the Agda standard library for these tests), you can see that the numbers appear increasingly slowly on the output. On my machine, it takes approximately five and a half minutes just to display the first 500 natural numbers.
In fear of becoming the laughingstock of Encsé for writing an elaborate, theoretically sound program that cannot be executed, I've explored pushing down operations into Haskell.
Of course, the more you push into Haskell, the less you can formally prove about your Agda program. For some functions, there's no real loss; for example, the signature of Data.Nat.Show.show is simply ℕ → String, which (apart from the requirement of finiteness) we can easily emulate in Haskell.
To give you an idea of the possible speedup, modifying the code above to use an implementation of show written in Haskell, the runtime now becomes 24 milliseconds:
module ShowNat-Foreign where open import Data.Nat open import Data.ForeignNat.Show open import Data.List open import IO import Data.Colist numbersBelow : ℕ → List ℕ numbersBelow zero = [] numbersBelow (suc n) = numbersBelow n ++ [ n ] main = run (mapM′ putStrLn∞ (Data.Colist.fromList (Data.List.map show (numbersBelow 500))))
The devil's in the details, of course: in this case, in the ForeignNat library. I've uploaded a browsable snapshot here, and there's a nice HTML version here. If you want to try it out, I suggest getting it from the Git repository using the following command:
git clone git://gergo.erdi.hu/agda/foreign-natIt includes not just Data.ForeignNat.Show used above, but also Data.ForeignNat.Divisibility, which calculates the remainder in Haskell, using Haskell's Integer type to speed things up. Of course, deciding divisibility also means that it needs to produce a proof of divisibility, so there's more plumbing required on the Agda side.
I haven't yet modified my 9-digit number searcher program to use these foreign implementations, but I'm looking forward to seeing if implementing these speedups finally permit running the damn program.
A héten Rinus Plasmeijer tartott vendégelőadást a generikus programozásról funkcionális nyelvekben, bemutatva a Clean azon nyelvi elemeit, amivel explicit szupportálja ezt. Encsé a lelkemre kötötte, hogy blogjam meg, miről is van szó.
Kiinduló "Hello World"-ként tekintsük az alábbi Haskell kódot, benne két, strukturális rekurzióval definiált ekvivalenciavizsgálatot listákra és fákra:
data List a ≔ Nil | Cons a (List a)Kellően messziről nézve a fenti két függvény láthatóan megegyezik. Az általános (mondhatnám, generikus) ötlet éppen az, hogy a strukturális rekurziót mint patternt nyelvi elemmé emeljük.
A mostani tárgyaláshoz vizsgáljuk azt a leegyszerűsített világot, ahol csak algebrai típusok léteznek: két típuskonstrukciós műveletünk az unió és a direktszorzat. Ekkor tetszőleges típus reprezentálható három művelet kifejezéseként: E az elágazás; P a (kételemű) pár; U (unit) pedig az egység, a struktúra levelei. Pl. Haskellben egyszerűen bevezethető három, a fenti műveleteket reprezentáló típus, és azokkal könnyedén leírható például a fenti két adatszerkezetünk:
data U ≔ UAz okosság ott van, hogy így az algebrai típusok általános leírását beemeltük a nyelvbe, és már le tudjuk írni a strukturális rekurziót, mint generikus egyenlőségvizsgálat-műveletet:
instance Eq U whereMármost könnyen mozoghatunk a valódi típus és a generikus reprezentáció között:
fromList :: List a → List' a... és már csak egy kis összedrótozás kell ahhoz, hogy megkapjuk az egyenlőségvizsgálat implementációját a két konkrét típusunkra:
instance (Eq a) ⇒ Eq (List a) whereA fentiek kifejtése után az előadás szólt még arról, hogy milyen segítséget ad a Clean a fentihez hasonló programozáshoz (nyilván a fenti List', Tree' típusokat és a konverziós műveleteket ilyenkor a fordító generálja), ilyenkor tehát a fenti példában csak az U, E és P típusokra kell megírnunk a további típusokra generikusan implementálandó függvényünket.
Sajnos ezekután az utolsó két előadás már csak az iTask nevű workflow-manager webes alkalmazásról szólt, ami bár érdekes dolgokat csinál, de valójában már semmi köze nem volt az eredeti témához (főleg mivel, az eredeti ígérettel ellentétben, végül nem tért ki az előadás arra, hogy az iTask belsejében hogyan és mik vannak generikusan implementálva).