Repository: JKTKops/Write-You-a-Haskell-2
Branch: master
Commit: 936068533cc3
Files: 15
Total size: 99.2 KB
Directory structure:
gitextract_rjrhl3fo/
├── .gitignore
├── 10/
│ └── auxiliary_data_structures_overview.md
├── 404.html
├── 7/
│ └── 7.5_additions_to_poly.md
├── 8/
│ └── design_of_protohaskell.md
├── 9/
│ ├── 9.1_lexing.md
│ └── 9.2_parsing.md
├── Contributing.md
├── Overview.md
├── README.md
├── Sources.md
├── _config.yml
├── _layouts/
│ └── default.html
├── _sass/
│ └── jekyll-theme-midnight.scss
└── table_of_contents.md
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
_site
.sass-cache
.jekyll-cache
.jekyll-metadata
vendor
================================================
FILE: 10/auxiliary_data_structures_overview.md
================================================
---
title: Chapter 10 Overview
previous: Parsing
next: Unique and UniqueMap
---
# Auxiliary Data Structures
Like any large project, a compiler needs to move around lots of data. Some of this data is in the form of abstract syntax, which we constructed last chapter. But there is other data too. There is state, which needs to be kept inside of each pass as the compiler works, and there are warnings and errors that need to be handed around somehow.
The modules in this chapter can be done in almost any order, and some of them are optional. Here's what we will do in this chapter:
1. `Unique`s, a way to "uniquely identify" different entities during compilation. The way we will do this is not the most efficient, but it's fairly easy to work with. Some form of `Unique` is required, though it doesn't have to be the same as what is covered here.
2. `UniqueMap`, a mapping from uniques to some other data type. This gives us very fast lookups in maps of entities. If your `Unique` type is just `Int` (or a `newtype` of `Int`), then use `Data.IntMap` as your backing structure instead of what is covered here. Abstracting this is useful in case a later refactor changes the representation of `Unique`s.
3. `Bag`, a set that can contain duplicates. These are also called `MultiSet`s, call yours whatever you like. This is a basic type that isn't really worth writing a section about - you can use GHC's implementation (be sure to include the copyright notice), make a dependency on `Data.MultiSet` from the `multiset` package, or just use lists.
4. `CDoc`, a wrapper on `Doc` from the `prettyprinter` package. This will let us easily construct well-formatted messages, as well as change those messages based on compilation flags. This module is optional, especially if you don't plan on producing good error messages. Pretty printers for our data structures will use `CDoc`, but you can just use `Doc` instead. The `C` is for `Context`. (GHC calls it `SDoc`, but no one in the IRC channel seems to know why)
5. `Messages`. This module contains all of the backing implementation of error and warning messages. If you don't plan on producing good error messages or any warnings, you can pretty much skip this module.
6. `Panic`. This module contains functions for crashing the compiler in various ways, with some nice error messages. We define `panic` is for invariant violations, and `sorry` for unimplemented features. This module is technically optional, but it's very small.
7. `FastString`, a wrapper on `Data.Text.Lazy` that caches each `FastString` as it is created, and returns the existing one instead of creating duplicates. Most importantly, this lets us attach `Unique`s to `FastString`s themselves, since each one contains a unique string. This module is technically optional, but I think it would be more work to skip it.
8. `IOEnv`, a monad which forms the basis of the other monads we will use in the frontend. This module is optional - you could use `ReaderT` over `IO` instead, or `State` and some way of handling failure. Since we're in IO anyway, I'll demonstrate using IO to get statefulness (like `StateT`) and the ability to fail (like `MaybeT`).
If you just want to build a smaller compiler that assumes its input is correct code, you should skip `Messages`, and ignore them in future chapters. Replace throwing of errors with `panic` and don't produce warnings.
`CDoc` is also helpful for suppressing certain types of debugging output. If you skip it, most debugging will probably happen via dumps and ad-hoc traces.
================================================
FILE: 404.html
================================================
---
permalink: /404.html
layout: default
---
404
Page not found :(
The requested page could not be found.
================================================
FILE: 7/7.5_additions_to_poly.md
================================================
---
title: Additions to Poly
next: Design of ProtoHaskell
---
# Additions to Poly
NOTE: Following better understanding of later parts of the compiler, each phase's monad will instead manage a supply of `Unique`s inside its `IOEnv`'s environment.
OTHER NOTE: The `Outputable` module described here is fine for now, but is revamped in Chapter 10.
This page will be updated to match at a later date.
The SupplyT Monad Transformer
As explained previously, a monad transformer is a way of adding new capabilities onto an existing monad, where of course monads are like "first-class actions." Frequently throughout compiling, we'll need to grab fresh names (and possibly numbers, etc). Rather than explicitly adding a component to the state layer of our monads, we can abstract this out into its own transformer layer.
Since we intend to use `SupplyT` inside other monads, we'll need to provide a _monad class_.
```haskell
module Control.Monad.Supply.Class
( MonadSupply(..)
) where
class Monad m => MonadSupply s m | m -> s where
supply :: m s
isExhausted :: m Bool
```
The phrase `| m -> s` is called a "Functional Dependency", and means that `m` _determines_ `s`. This means that any particular monad `m` can only have one `MonadSupply` instance. This is fine for our purposes.
(Beginners with Functional Dependencies should note that this means `StateT Int (State String)` is fundamentally distinct from `State (Int, String)`, because the former doesn't have a `MonadState s` instance where `s` contains `Int`. `s` is forced to `String` by the functional dependency.)
`isExhausted :: MonadSupply s m => m Bool` is provided just in-case. I suspect that every time we need this monad, we will be using an infinite supply, and we won't insert exhausted-checks when our supply is infinite.
Since we also plan on using other monads with `SupplyT`, we need to make sure that if `m` is a `MonadSupply`, then common transformers on top of `m` are still a `MonadSupply`. These instances are trivial:
```haskell
instance MonadSupply s m => MonadSupply s (ExceptT e m) where
supply = lift supply
isExhausted = lift isExhausted
```
We provide instances for `ExceptT`, `StateT`, `RWST`, `ReaderT`, and `WriterT` (at least, the lazy variants).
Then we'll need an actual implementation:
```haskell
{-# LANGUAGE UndecidableInstances #-}
module Control.Monad.Supply where
import Control.Monad.Supply.Class
import Control.Monad.State
import Control.Monad.Identity
newtype SupplyT s m a = SupplyT (StateT [s] m a)
deriving ( Functor, Applicative, Monad
, MonadTrans, MonadIO, MonadFix)
runSupplyT :: Monad m => SupplyT s m a -> [s] -> m a
runSupplyT (SupplyT m) init = evalStateT m init
```
Note that we _don't_ derive `MonadState [s]` for `SupplyT`. If we did, and we later put a `SupplyT` on top of a `State`, then attempting to call `supply` would break the functional dependency forced by `State`.
Because we don't derive `MonadState [s]`, we need to manually wrap `get`:
```haskell
getSupply :: Monad m => SupplyT s m [s]
getSupply = SupplyT get
```
And finally we provide the `MonadSupply` instance:
```haskell
instance Monad m => MonadSupply m (SupplyT s m) where
supply = supplyST
isExhausted = isExhaustedST
supplyST :: Monad m => SupplyT s m s
supplyST = SupplyT $ state $ \s -> (head s, tail s)
isExhaustedST :: Monad m => SupplyT s m Bool
isExhaustedST = SupplyT $ gets null
```
We'll also provide a couple of default supplies for common types. The most important one is a supply for names:
```haskell
defaultNameSupply :: [String]
defaultNameSupply = [1..] >>= flip replicateM ['a'..'z']
```
This results in the list `["a", "b", ..., "z", "aa", "ab", ...]`.
The last detail is that the monad class instances go both ways. We've provided instances of `MonadSupply` when a common transformer is on top of a `MonadSupply`, but we could also have a `SupplyT` on top of a common transformer. So we also have to provide instances of the form:
```haskell
instance MonadError e m => MonadError e (SupplyT s m)
```
These instances can be tricky, and I won't copy them all here. The trick is to unwrap the `SupplyT` action by running it, use the instance from the underlying monad, and then wrap it all back up with
```haskell
SupplyT $ StateT $ \s -> ...
```
Minor Corrections to the Parser
This section can be skipped, since we'll be upgrading the parser pretty heavily soon.
The main issues with the existing parser are
- parseModule fails to parse let-expressions
- let rec expressions aren't recursive
- let bindings don't allow function sugar
The first problem can be attacked by splitting `exec` in `Interpreter.Main` into two functions, one for loading modules and one for executing toplevel expressions. This would be nice so that loading a module doesn't occasionally evaluate expressions and run them, especially since that doesn't fit the language grammar. We'll take this step later, while upgrading the interpreter. For now, we can provide a quick patch by changing
```haskell
decl = try letrecdecl <|> letdecl <|> val
```
to
```haskell
decl = try val <|> try letrecdecl <|> letdecl
```
To fix the second issue and third issues, we make a simple modification to `letin` (and also to `letrecin`, but as can be seen here I have combined them)
```haskell
letin :: Bool -> Parser Expr
letin isrec = do
reserved "let"
when isrec $ reserved "rec"
x <- identifier
args <- many identifier
reservedOp "="
e1 <- expr
reserved "in"
e2 <- expr
let rhs = foldr Lam e1 args
if isrec
then return $ Let x (Fix rhs) e2
else return $ Let x rhs e2
```
Correction to the Type Inferencer
As noted in some issues on the original Write You a Haskell github page, let-polymorphism is incorrectly implemented. The problem with the original implementation is exemplified by
```haskell
Poly> \x -> let y = x + 1 in y
<> :: forall a. Int -> a
```
This type is clearly wrong; it should be `Int -> Int`. What happened was we generated the types `x :: a` and `x + 1 :: b`, and then unified `a ~ Int` and `b ~ Int`. Then we generalized; `y :: forall b. b`. Now in the body of the let we instantiated this scheme to `y :: c`.
Then when constraints were solved, `b ~ Int` never manifested, because no expression had the type `b`.
To fix this, we need to solve the constraints on the rhs _before_ generalizing. Unfortunately, this is slightly out of line with the separation of constraints and solving, but fortunately we can simply re-use our solution to solving the constraints here and keep them separated.
Change
```haskell
infer expr = case expr of
...
Let x e1 e2 ->
env <- ask
t1 <- infer e1
let sc = generalize env t1
t2 <- inEnv (x, sc) $ infer e2
return t2
```
to
```haskell
infer expr = case expr of
...
Let x e1 e2 ->
env <- ask
(t0, cs) <- listen $ infer e1
subst <- liftEither $ runSolve cs
let t1 = apply subst t0
sc = generalize env t1
t2 <- inEnv (x, sc) $ infer e2
return t2
```
We use `listen` from `MonadWriter` to get the constraints generated during inference of `e1`, solve those constraints, and apply the solution to `t0`. Then we generalize as before.
Final Notes
The `Outputable` class will be used to pretty print compiler output. The compiler will always produce "human-readable" code. The `Outputable` module exports the class and the entire `pretty` library, as well as a few extra helper functions that we'll define as needed.
To avoid orphan instances, I try to put `Outputable` instances for a type at the bottom of the file the type is defined in. In many cases, we'll want to use one type internally to represent, say, errors, and then in a different module we will translate it into another type with a more opaque representation of the information we care about (source code location of the problem, list of contributing factors, etc.) and give _that_ type the `Outputable` instance. This structure is easier to organize and results in well-localized error message generation.
================================================
FILE: 8/design_of_protohaskell.md
================================================
---
title: Design of ProtoHaskell
next: Lexing
previous: Additions to Poly
---
# Design of ProtoHaskell
We're mostly going to follow Diehl's original planning here, but I'm going to reiterate anyway and clean up some details which I may change. This page is a re-write of the original Chapter 8 page from Diehl.
The most relevant detail is that I will refer to the language we are implementing as `ProtoHaskell` through the entire process, including interpreting and generating code.
Haskell: A Rich Language
At its core, Haskell is a surprisingly simple and elegant language. But the implementation of a full-powered compiler like GHC is not simple. Even though Haskell can be reduced to a small, expressive subset (called the _kernel_), Haskell itself has had much thought go into making the frontend language so expressive and powerful. Many of the details in doing so are going to require quite a bit of engineering work.
Consider this 'simple' Haskell example, and contrast with what we already have in `Poly`.
```haskell
filter :: (a -> Bool) -> [a] -> [a]
filter pred [] = []
filter pred (x:xs)
| pred x = x : filter pred xs
| otherwise = filter pred xs
```
Consider all the things going in this simple example.
- Lazy evaluation
- Higher order functions
- Parametric polymorphism
- Pattern matching (and defining a function with it)
- Guards
- User-provided type annotation
- List syntactic sugar
Of course Haskell also has custom datatypes, but as we'll see, `Bool` and `[]` have to be somewhat baked in to the compiler.
To handle all these things, we need a much more sophisticated design, and we'll need to track a lot more information about our program during compilation.
Scope
This project is intended to be a toy language, so I don't plan on implementing all of Haskell 2010 (but who knows, I might!). However we will definitely implement a sizable portion of Haskell 2010, enough to write real programs and implement all or most of the standard Prelude.
Things we will implement:
- Indentation sensitive grammar
- Pattern matching
- Algebraic data types
- `where` clauses
- Recursive functions/types
- Operator sections
- Implicit let-rec
- Syntax sugar for lists, tuples, and arrows
- Records
- User-defined operators
- Do-notation sugar
- Type annotations
- Monadic IO
- Typeclasses (single-param)
- Arithmetic primops
- Type synonyms
- List comprehensions
Things we might implement:
- Overloaded literals
- Optimization passes (and interface files)
- Newtypes
- Module namespaces
- Polymorphic recursion
- GADTSyntax (but not GADTs)
- Some GHC-specific extensions, like `TupleSections`
- Foreign Function Interface
Things we won't implement
- GADTs
- Defaulting rules
- Exception handling (`error` and `undefined` are guaranteed to crash immediately)
If possible, I'd like ProtoHaskell to eventually conform to the Haskell '98 standard. Either way, it will definitely belong to the "Haskell language family."
Intermediate Forms
The basic structure of the compiler will follow fairly closely to that of GHC.
```
Parse -> Rename -> Typecheck -> Desguar
-> Ph2Core -> Simplify -> Core2STG -> STG2Java
```
It's important that we do Typechecking _before_ desugaring, so that in the event of an error message we can easily show the user the code that _they wrote_. We could alternatively do this by passing along lots of annotations about how the code was desugared. However, some desugarings largely complicate or add depth to the AST, so keeping track of these annotations properly would become very complicated. On the other hand, typechecker cases even for `PhExpr` are surprisingly simple (or at least, those for `HsExpr` in GHC are) so writing more cases here will cause less complications.
The `Simplfy` phase contains most of the optimization work. We won't implement too many optimizations, but we'll develop the architecture for "plugging in" rewrites of a Core AST to the simplifier loop. Then the interested reader can write as many optimization passes as they would like.
For the interpreter, we'll intercept the output of the compiler after `Core` and interpret the `Core`. Depending on how this works, we may instead take the `Core` and produce `bytecode` - not Java bytecode, but our own bytecode form which we can interpret.
Compiler Monad
The main driver of the compiler will be an `ExceptT` + `StateT` + `IO` stack. All of the compiler passes will hang off this monad.
```haskell
newtype CompilerM = CompilerM
{ unCompilerM :: ExceptT Msg
(StateT CompilerState IO) }
data CompilerState = CompilerState
{ fname :: Maybe FilePath -- Name of file being compiled
, imports :: [FilePath] -- Filenames of imports
, src :: Maybe L.Text -- Module source
, ty_env :: TyEnv.TyEnv -- Type environment
, ki_env :: KiEnv.KiEnv -- Kind environment
, cls_env :: ClsEnv.ClsEnv -- Typeclass environment
, c_ast :: Maybe Core.CoreSyn -- AST (after core)
, flags :: Flags.Flags -- Compiler flags
, d_env :: DataEnv.DataEnv -- Entity dictionary
, cls_hrchy :: ClsEnv.ClsHeirarchy -- Typeclass heirarchy
}
```
Of course all of this is subject to change, because I'm learning as I go. In particular, the state doesn't contain the `PhSyn` AST. This is because we'll follow GHC's example and parameterize our AST by the type of identifiers in it. We'll manually thread the different AST type through the pipeline. The `CoreSyn` AST on the other hand will only ever have one type of identifier (the `Expr` type will be parameterized by the types of the binders, but it will always be parameterized by `CoreBndr` between compiler phases).
I won't necessarily follow the original chapter plan, but for the next several chapters we will be incrementally building a series of transformations.
```haskell
parsePhModl :: FilePath -> L.Text -> CompilerM (PhSyn.PhSyn RdrName)
rename :: PhSyn.PhSyn RdrName -> CompilerM (PhSyn.PhSyn Name)
typecheckPh :: PhSyn.PhSyn Name -> CompilerM (PhSyn.PhSyn Id)
desugar :: PhSyn.PhSyn Id -> CompilerM (PhSyn.PhSyn Id)
ph2Core :: PhSyn.PhSyn Id -> CompilerM CoreSyn.CoreSyn
simplify :: CompilerM () -- Now the AST is stored in the compiler state
tidyCore :: CompilerM ()
prepCore :: CompilerM ()
core2STG :: CompilerM StgSyn.StgSyn
stg2Java :: StgSyn.StgSyn -> CompilerM Java.Syn
```
At the end, we simply pretty-print the Java AST that we've built, along with our runtime system.
For the interpreter, I'm _not worried about performance_. I plan to simply recompile the source of everything in the interpreter environment at each command. When interpreting, we'll intercept the code _before_ the simplifier and execute that. But we'll provide commands to go further into the pipeline and show the results.
After we have all these transformations ready, the compiler itself becomes a straightforward chain of all these transformations.
```haskell
compileModule :: Flags.Flags -> (IFaceSyn, Java.Syn)
compileModule = runCompilerM $
parsePhModl
>=> rename
>=> typecheckPh
>=> desugar
>=> ph2Core
>> do simplify
tidyCore
iFace <- core2iFace
prepCore
core2STG
javaSyn <- stg2Java
return (iFace, javaSyn)
```
Expect this pipeline to change somewhat depending on how `IO` ends up being woven through it. For example, some amount of the compiler state might be stored in `IORef`s instead.
Engineering Overview
Lots of implementation details have already been discussed, but here I'll flesh out the rest.
Copied straight from the original WYAH:
REPL
It is important to have an interactive shell to be able to interactively explore the compilation steps and intermediate forms for arbitrary expressions. GHCi does this very well, and nearly every intermediate form is inspectable. We will endeavor to recreate this experience with our toy language.
If the ProtoHaskell compiler is invoked either in GHCi or as standalone executable, you will see a similar interactive shell.
Command line conventions will follow GHCi's naming conventions. There will be a strong emphasis on building debugging systems on top of our architecture so that when subtle bugs creep up you will have the tools to diagnose the internal state of the type system and detect flaws in the implementation.
Command | Action
--- | ---
:browse | Browse the type signatures for loaded modules
:load \ | load a program from file
:reload | Reload the active file
:core | Show the core (pre-simpl) of an expression
:modules | Show all loaded module names
:source | Show the source of an expression
:type | Show the type of an expression
:kind | Show the kind of an expression
:set \ | Set a flag
:unset \ | Unset a flag
:constraints | Dump the typing constraints for an expression
:quit | Exit interpreter
For example:
```haskell
> :type plus
plus :: forall a. Num a => a -> a -> a
> :core id
id :: forall a. a -> a
id = \(ds1 : a) -> a
> :core compose
compose :: forall c d e. (d -> e) -> (c -> d) -> c -> e
compose = \(ds1 : d -> e)
(ds2 : c -> d)
(ds3 : c) ->
(ds1 (ds2 ds3))
```
The flags we use also resemble GHC's and allow dumping out the pretty printed form of each of the intermediate transformation passes.
- `-ddump-parsed`
- `-ddump-rn`
- `-ddump-desugar`
- `-ddump-infer`
- `-ddump-types`
- `-ddump-core`
- `-ddump-simpl`
- `-ddump-stg`
- `-ddump-java`
- `-ddump-to-file`
When compiling normally, these flags will just result in the corresponding intermediate form(s) getting dumped to `protohaskellcompiler.dump` (or to a specified file, with `ddump-to-file` on). When interpreting, any flag after `ddump-core` will cause the compiler to go further than it normally would during interactive sessions. It'll go as far down the pipeline as it needs to in order to dump the requested form.
We won't dump to `stdout` since, as mentioned above, the rather not-smart interpreter will recompile everything entered via the current session on every command.
We'll implement the repl with `repline`.
Parser
We'll use Alex for lexing and then a normal Parsec parser, with a custom user-state extension for indentation-sensitive parsing.
We _won't_ add operator context sensitivity to the parser. We'll do this the "right way," by parsing all operators with a default precedence and associativity. After parsing, we collect the "true" fixity information and correct the AST. GHC follows this pattern (and this is why errors about orphan fixity declarations don't go with errors about operators not being in scope).
Renamer
After parsing, we will traverse the AST and transform each user-named variable from
```haskell
data RdrName = RdrName L.Text SrcPos
```
to
```haskell
data Name = Name L.Text Unique SrcPos
type Unique = (Pass, Int)
```
The `Unique`s will be distinct inside each compiler pass. In GHC, a more advanced (and faster) method of generating `Uniq`s is used that generates mere `Int`s, and the same `Uniq` is never generated twice. We won't concern ourselves with this, as it's much trickier to implement and uses the FFI. It's more performant - but I'm not concerned with performance. An interested reader is, as always, welcome to improve on this.
Datatypes
User defined data declarations need to be handled and added to the typing context so that their use throughout the program logic can be typechecked.
```haskell
data Bool = False | True
data Maybe a = Nothing | Just a
data T1 f a = T1 (f a)
```
Each constructor definition will also introduce several constructor functions into the Core representation of the module. Record types will also be supported and will expand out into selectors for each of the various fields.
Type Checking
Type inference and type checking will both happen here. Once the whole program is typechecked, we can transform the identifier type of the AST to its final form.
```haskell
data Id = Id L.Text Unique Type SrcPos
```
Desguaring
There are lots of things to desugar out of the `PhSyn` form, including list forms, and type constructors like `(,)` and `(->)`. But by far the most important is _pattern matching_. The implementation of pattern matching is remarkably subtle, and allows for nested matches and incomplete patterns in the front end language. But these can generate very complex _splitting trees_ of case expressions that need to be expanded. This is one of the things that I don't know how to do yet, and is one of the things I'm most interested in learning via this project.
We will implement the following syntactic sugar translations:
- `if/then/else` statements into `case` expressions
- pattern guards into `case` expressions (but probably not the language extension `PatternGuards`)
- Do-notation for monads
- list sugar
- tuple sugar
- operator sections
- string literals
- numeric literals
Unlike Diehl's original plans, I do plan to implement overloaded literals eventually, but not in the "first pass." Overloaded literals means that we replace numeric literals with calls to functions from the `Num` and `Fractional` classes.
```haskell
-- Frontend
42 :: Num a => a
3.14 :: Fractional a => a
-- Desguared
fromInteger (42 :: Integer)
fromRational (3.14 :: Rational)
```
Core
The Core language is the result of translation of the frontend language into an explicitly typed form. Like GHC, we will use a System-F variant, but unlike GHC we will use "vanilla" System-F. GHC has included a couple extensions to System-F in Core which it uses to implement several fancy features. In fact, GHC's core is more accurately called _System-FC_. The interested reader is welcome to implement these extensions on top of the work done here.
The Core language is one of the most defining features of GHC Haskell - the compilation into a statically typed intermediate language. It is a well-engineered detail of GHC's design and it has informed much of how Haskell has evolved. Simon Peyton Jones says "if you can translate it into Core, then [an extension] is really just some form of syntactic sugar. But if it would require an extension to Core, then we have to think a lot more carefully."
```haskell
data Expr b -- b is the type of binders
= Var Id
| Lit Literal
| App (Expr b) (Arg b)
| Lam b (Expr b)
| Let (Bind b) (Expr b)
| Case (Expr b) b Type [Alt b]
| Type Type
-- A general Expr should never be a Type, but an Arg can be
type Arg b = Expr b
-- Case split alternative.
type Alt b = (AltCon, [b], Expr b)
data AltCon
= DataAlt DataCon
| LitAlt Literal
| DEFAULT
deriving Eq
data Bind b = NonRec b (Expr b)
| Rec [(b, (Expr b))]
type CoreExpr = Expr CoreBndr
type CoreBndr = Var
```
These definitions are taken directly from GHC - in `compiler/coreSyn/CoreSyn.hs`. However I've removed the features of GHC Core that GHC uses to implement fancy things like coercions. We'll implement newtypes (something that GHC uses coercions for) on a _second_ pass over the project, and we'll do it by effectively replacing newtype wrapping and unwrapping with `id`. Then we'll let it get optimized away. As always, an interested reader is encouraged to implement coercions if they want.
Since Core is covered in explicit types, implementing an internal type checker will be trivial. We'll provide a flag to run this type checker on generated core after desugaring and optimizations. SPJ calls this a "crucial sanity check."
Type Classes
We'll implement (only single-parameter) typeclasses with the usual _dictionary passing_ translation. The logic isn't terribly complicated, but can be very verbose and requires lots of bookkeeping about the global typeclass hierarchy.
Even a simple typeclass can generate some very elaborate definitions.
```haskell
-- Frontend
class Num a where
plus :: a -> a -> a
instance Num Int where
plus = plusInt
plusInt :: Int -> Int -> Int
plusInt (I# a) (I# b) = I# (plusInt# a b)
-- Core
plusInt :: Int -> Int -> Int
plusInt = \(ds1 : Int)
(ds2 : Int) ->
case ds1 of {
I# ds8 ->
case ds2 of {
I# ds9 ->
case (plusInt# ds8 ds9) of {
__DEFAULT {ds5} -> (I# ds5)
}
}
}
dplus :: forall a. DNum a -> a -> a -> a
dplus = \(tpl : DNum a) ->
case tpl of {
DNum a -> a
}
plus :: forall a. Num a => a -> a -> a
plus = \($dNum_a : DNum a)
(ds1 : a)
(ds2 : a) ->
(dplus $dNum_a ds1 ds2)
```
We'll subject type classes to the normal restrictions (that is, no `FlexibleInstances`, `FlexibleContexts`, or `UndecidableInstances`)
- Paterson condition
- Coverage condition
- Bounded context stack
Frontend
The Frontend language for ProtoHaskell is a fairly large language, consisting of many different types. Let's walk through the different constructions. The frontend syntax is split across several datatypes.
- `Decls` - Declarations syntax
- `Expr` - Expressions syntax
- `Lit` - Literal syntax
- `Pat` - Pattern syntax
- `Types` - Type syntax
- `Binds` - Binders
At the top is the named _Module_ and all toplevel declarations contained therein. The first revision of the compiler has a very simple module structure, which we will extend later with imports and public interfaces.
```haskell
data PhSyn = Module Name [Decl] -- ^ module T where { .. }
deriving (Eq,Show)
```
Declarations or `Decl` objects are any construct that can appear at the toplevel of a module. These are namely function, datatype, typeclass, and operator definitions.
```haskell
data Decl
= FunDecl BindGroup -- ^ f x = x + 1
| TypeDecl Type -- ^ f :: Int -> Int
| DataDecl Constr [Name] [ConDecl] -- ^ data T where { ... }
| ClassDecl [Pred] Name [Name] [Decl] -- ^ class (P) => T where { ... }
| InstDecl [Pred] Name Type [Decl] -- ^ instance (P) => T where { ... }
| FixityDecl FixitySpec -- ^ infixl 1 {..}
deriving (Eq, Show)
```
A binding group is a single line of definition for a function declaration. For instance the following function has two binding groups.
```haskell
factorial :: Int -> Int
-- Group #1
factorial 0 = 1
-- Group #2
factorial n = n * factorial (n - 1)
```
One of the primary roles of the desugarer is to merge these disjoint binding groups into a single splitting tree of case statements under a single binding group.
```haskell
data BindGroup = BindGroup
{ matchName :: Name
, matchPats :: [Match]
, matchType :: Maybe Type
, matchWhere :: [[Decl]]
} deriving (Eq, Show)
```
The expression or `Expr` type is the core AST type that we will deal with and transform most frequently. This is effectively a simple untyped lambda calculus with let statements, pattern matching, literals, type annotations, if/then/else statements and do-notation.
```haskell
data Expr
= App Expr Expr -- ^ a b
| Var Name -- ^ x
| Lam Name Expr -- ^ \\x . y
| Lit Literal -- ^ 1, 'a'
| Let Name Expr Expr -- ^ let x = y in x
| If Expr Expr Expr -- ^ if x then tr else fl
| Case Expr [Match] -- ^ case x of { p -> e; ... }
| Ann Expr Type -- ^ ( x : Int )
| Do [Stmt] -- ^ do { ... }
| Fail -- ^ pattern match fail
deriving (Eq, Show)
```
Inside of case statements will be a distinct pattern matching syntax, this is used both at the toplevel, for function declarations, and inside of case statements.
```haskell
data Match = Match
{ matchPat :: [Pattern]
, matchBody :: Expr
, matchGuard :: [Guard]
} deriving (Eq, Show)
data Pattern
= PVar Name -- ^ x
| PCon Constr [Pattern] -- ^ C x y
| PLit Literal -- ^ 3
| PWild -- ^ _
deriving (Eq, Show)
```
The do-notation syntax is written in terms of three constructions, one for monadic binds , one for monadic statements, and one for `let`.
```haskell
data Stmt
= Generator Pattern Expr -- ^ pat <- exp
| Let Pattern Expr -- ^ let pat = exp
| Body Expr -- ^ exp
deriving (Eq, Show)
```
Literals are the atomic wired-in types that the compiler has knowledge of and will desugar into the appropriate builtin datatypes (and later, to appropriate overloaded function calls).
```haskell
data Literal
= LitInt Int -- ^ 1
| LitChar Char -- ^ 'a'
| LitString [Char] -- ^ "foo"#
deriving (Eq, Ord, Show)
```
For data declarations we have two categories of constructor declarations that can appear in the body, regular constructors and record declarations. We will add support for `GADTSyntax` after the first revision.
```haskell
-- Regular Syntax
data Person = Person String Int
-- GADTSyntax
data Person where
Person :: String -> Int -> Person
-- Record Syntax
data Person = Person { name :: String, age :: Int }
data ConDecl
= ConDecl Constr Type -- ^ T :: a -> T a
| RecDecl Constr [(Name, Type)] Type -- ^ T :: { label :: a } -> T a
deriving (Eq, Show, Ord)
```
Fixity declarations are simply a binding between the operator symbol and the fixity information.
```haskell
data FixitySpec = FixitySpec
{ fixityFix :: Fixity
, fixityName :: String
} deriving (Eq, Show)
data Assoc = L | R | N
deriving (Eq,Ord,Show)
data Fixity = Infix Assoc Int
deriving (Eq,Ord,Show)
```
Diehl's original Chapter 8 provides many examples for the use of these types, which I will also copy here eventually.
Traversals
We'll quite frequently need to run over parts of the AST and replace certain patterns with other patterns. We want to automate this process and abstract it over any of the monads we may be working in.
```haskell
traverseAstM :: Monad m => (Expr -> m Expr) -> Expr -> m Expr
traverseAstM f e = f e >>= \e' -> case e' of
App a b -> App <$> traverseAstM f a <*> traverseAstM f b
Var a -> return e'
Lam a b -> Lam <$> pure a <*> traverseAstM f b
Lit n -> return e'
Let n a b -> Let <$> pure n <*> traverseAstM f a <*> traverseAstM f b
If a b c -> If <$> traverseAstM f a <*> traverseAstM f b <*> traverseAstM f c
Case a xs -> Case <$> traverseAstM f a <*> traverse (descendCase f) xs
Ann a t -> Ann <$> traverseAstM f a <*> pure t
Fail -> return e'
descendCase :: Monad m => (Expr -> m Expr) -> Match -> m Match
descendCase f match = case match of
Match ps a -> Match <$> pure ps <*> traverseAstM f a
```
The case of pure expression rewrites corresponds to the Identity monad.
```haskell
traverseAst :: (Expr -> Expr) -> Expr -> Expr
traverseAst f e = runIdentity $ traverseAstM (return . f) e
```
This framework will let us do nice things like compose AST rewrites using the fish (`>=>`) operator.
Closing Remarks
This is just planning, and we're already at the point where I've likely made a mistake or bad design decision that will cause me headache later. If you know better than me (and if you know much of anything about writing a functional compiler, that's you!) please don't hesitate to file an issue on this repo or shoot me a message on Reddit at u/JKTKops.
================================================
FILE: 9/9.1_lexing.md
================================================
---
title: Lexing
previous: Design of ProtoHaskell
next: Parsing
---
# Lexing
// TODO: context-sensitive lexing for `MagicHash`
// `lex` will need to take compiler flags.
Just like with our inference system in Poly, there are two systems playing together during parsing. The first system takes the input string and transforms it into a list of meaningful _units_ called `Tokens`. Tokens are things like `TokenLParen` and `TokenId String`. This system is called the _lexer_. The other system is the _parser_, which consumes tokens to construct a _parse tree_. In many cases, the parse tree and the abstract syntax tree of a language are not completely identical. This will be the case for us.
Rather than continuing to do lexing with Parsec, we'll use a real lexer generator called `Alex`. Then we'll use the token stream from `Alex` as the input to a `Parsec` parser. Separating these concerns will simplify using Parsec later.
My main concern with Alex is that it doesn't support `Data.Text`. I could wrap it myself but that seems to be a fairly involved effort. So instead we'll let Alex work on an input `String` and then output `Data.Text.Lazy.Text` everywhere. I suspect that this will become a performance bottleneck - feel free to try supporting `Data.Text` with Alex!
Credit for this lexing strategy (and much of the lexer source) belongs to Simon Marlow, who also maintains Alex (and GHC!). The lexer is based on the Haskell '98 lexer presented in the `/examples` folder of the Alex source, plus several fixes.
Alex
`Alex` is a powerful lexer generator. With Alex, if we can describe the patterns that various tokens fit into, then Alex can split our text up into those tokens. This task is harder than it seems at first glance. Consider this Haskell example.
```haskell
main = let letresult = 5 in print letresult
```
When we encounter the string `"let"`, we should identify it as a reserved word. But when we encounter `"letresult"`, which has the same prefix, we need to identify it as a variable identifier (`VarId`). There are much more difficult cases to handle correctly; Alex handles this complexity for us.
To use Alex, we'll need to provide a _specification file_, `Lexer.x`, in our source tree. Alex spec files contain a mix of Alex-specific syntax and Haskell source code. Haskell source code is contained inside curly braces. We'll start with a module declaration.
```haskell
-- Lexer.x
{
module Compiler.Parser.Lexer (Token(..), Lexeme, lex) where
import Prelude hiding (lex)
import Data.Char (isAlphaNum, chr)
import Data.Text.Lazy (Text)
import qualified Data.Text.Lazy as T
import Utils.Outputable
}
```
Alex will copy this code directly into `Lexer.hs`, and the imports will be available to us in Haskell code blocks later in the source file.
Next we need to tell Alex what mode to run in. Alex has a `basic` mode, which will eventually produce a function `scanTokens :: String -> [Token]`. However it will be more powerful (and easier to extend in the future, if we need) to use the `monad` mode. This makes Alex run inside the `Alex` monad, which is essentially a custom `StateT AlexState (Except String)`. This way if Alex runs into an error, it won't crash the program, which is _very_ important for our interpreter.
```
%wrapper "monad"
```
Now we can start specifying _macros_. Macros can be either _character sets_ or _regular expressions_. Our rules will be represented by regular expressions, and we can expand macros into those. A character set lets us assign a name to a set of characters. Character sets are prefixed with `$`.
```
$whitechar = [\t\n\r\v\f\ ]
$special = [\(\)\,\;\[\]\{\}]
```
The braces denote a union, so the character set `[A-Z a-z]` is the set of characters that are in the set `A-Z` and the set of the characters in the set `a-z`. The character sets that we're interested in can be copied almost directly out of the Haskell Report, so I won't copy all this boilerplate here.
Next are the _regexes_. A regex is prefixed with `@`. Alex's regular expressions aren't exactly regular expressions, but they're really close. Take a look at the user guide for specifics.
```
@reservedid = case|class|data|default|...
@reservedOp = ".." | ":" | "::" | "=" |...
```
Notice that we reserve the `default` keyword. Even though we might not implement it, it's a good idea to reserve all keywords that we might want to implement in the future. This way we can throw an error rather than accepting a program that is not future-proof.
At the lexer level, we shouldn't try to separate things by type. Trying to lex an expression with type `Bool` differently than an expression of type `Int` would effectively require implementing a type inference engine in our lexer. The same goes for our parser. But at the lexing level we _can_ separate some things into namespaces, as described in the Haskell Report. Variable names _must_ start with a lowercase letter, module, type, data constructor, and typeclass names _must_ start with an uppercase letter, variable symbols _can't_ start with `:`, and constructor symbols _must_ start with `:`. Our lexer can separate these cases for us, so we'll create distinct patterns for them.
```
@varid = $small $idchar*
@conid = $large $idchar*
@varsym = $symbol $symchar*
@consym = \: $symchar*
```
We'll also accept numbers according to the Haskell Report spec, which means we need to support octal, hexadecimal, and exponent notation (for floats).
```
@decimal = $digit+
@octal = $octit+
@hexadecimal = $hexit+
@exponent = [eE] [\-\+] @decimal
```
We also need to provide a regex for `@string`, which is a bit involved. Our `@string` pattern should only match the legal characters of a string, but _not the string itself_. That is, it shouldn't match the start or end quotes. We'll put the quotes into the actual lexing rule.
Now that we have our patterns, we can provide lexing rules. Alex looks for the start of rules by looking for the pattern `:-`. It's common to include the name of the lexing scheme as well, so we put `Haskell :-`.
A rule in Alex starts with a _start code_, enclosed in angle brackets. This lets you tell Alex to only match certain rules in certain circumstances. We're not concerned with this, so we'll always use `0`.
To start, let's tell Alex to ignore whitespace (we'll parse it later using the source positions) and line comments.
```
<0> $white+ { skip }
<0> "--"\-*[^$symchar].* { skip }
```
The Haskell source in braces here is a _continuation_. More on this in a little bit.
ProtoHaskell allows _nested block comments_, which makes it easier to comment out a region of code that itself contains a block comment. This is rather complicated compared to the rest of the lexing logic, so we'll write it for Alex. We need to tell Alex to trigger on `{-` and the rule continuation will handle skipping the comment.
```
"{-" { nested_comment }
```
The _continuations_ in braces are functions that Alex will call once it has a match. `skip` is provided by Alex and simply ignores the match. We can use our own functions as continuations, as seen above; we need to write `nested_comment`. The type of the continuation depends on which wrapper is being used. In our case, we need `AlexInput -> Int -> Alex Token`. Below our rules, we can specify arbitrary Haskell source (in braces), which Alex will copy into the generated lexer. The `lex` function that we exported from the module belongs here, for example.
There's a catch with our tokenizing, though. When we parse, we're going to need position information to support indentation sensitivity. So rather than just creating tokens, we'll create what we'll (slightly inaccurately) refer to as a `Lexeme`, which will store a position as well as the token.
Let's define some data types that we need and a function to make our `Lexeme`s.
```haskell
{
-- L position token source
data Lexeme = L AlexPosn TokenType String deriving Eq
data TokenType
= TokInteger
| TokFloat
| TokChar
| TokString
| TokSpecial
| TokReservedId
| TokReservedOp
| TokVarId
| TokQualVarId
| TokConId
| TokQualConId
| TokVarSym
| TokQualVarSym
| TokConSym
| TokQualConSym
| TokEOF
deriving (Eq, Show)
-- | Our continuation, for example, @mkL TokVarSym@
mkL :: TokenType -> AlexInput -> Int -> Alex Lexeme
mkL tok (p,_,_,str) len = return $ L p tok (take len str)
}
```
We'll want to export `Lexeme` and `TokenType` too, for our parser.
Now that we have our continuation function, we can start specifying rules that match tokens. With the macros we've defined, this is pretty straightforward.
```
<0> $special { mkL TokSpecial }
<0> @reservedid { mkL TokReservedId }
<0> (@conid \.)+ @varid { mkL TokQualVarId }
<0> (@conid \.)+ @conid { mkL TokQualConId }
<0> @varid { mkL TokVarId }
<0> @conid { mkL TokConId }
<0> @reservedop { mkL TokReservedOp }
<0> (@conid \.)+ @varsym { mkL TokQualVarSym }
<0> (@conid \.)+ @consym { mkL TokQualConSym }
<0> @varsym { mkL TokVarSym }
<0> @consym { mkL TokConSym }
<0> @decimal
| 0[oO] @octal
| 0[xX] @hexadecimal { mkL TokInteger }
<0> @decimal \. @decimal @exponent?
| @decimal @exponent { mkL TokFloat }
<0> \' ($graphic # [\'\\] | " " | @escape) \'
{ mkL TokChar }
<0> \" @string* \" { mkL TokString }
```
Handling nested comments is a bit of a mess of special case handling (for example, not breaking on `--}`). Note that in ASCII, `45` is `-`, `123` is `{`, and `125` is `}`.
`alexMonadScan` scans for one lexeme in the `Alex` monad. `alexGetByte` gets the next byte of the given `AlexInput`.
```haskell
nested_comment :: AlexInput -> Int -> Alex Lexeme
nested_comment _ _ = do
input <- alexGetInput
go 1 input
where go 0 input = alexSetInput input >> alexMonadScan
go n input = do
case alexGetByte input of
Nothing -> err input
Just (c, input) -> do
case chr (fromIntegral c) of
'-' -> let temp = input
in case alexGetByte input of
Nothing -> err input
Just (125, input) -> go (n-1) input
Just (45, input) -> go n temp
Just (c, input) -> go n input
'\123' -> case alexGetByte input of
Nothing -> err input
Just (c, input)
| c == fromIntegral (ord '-') -> go (n+1) input
Just (c, input) -> go n input
c -> go n input
err input = alexSetInput input >> lexError "error in nested comment"
lexError :: String -> Alex a
lexError s = do
(p,c,_,input) <- alexGetInput
alexError $ showPosn p ++ ": " ++ s ++
(if not $ null input
then " before " ++ show (head input)
else " at end of file")
showPosn (AlexPn _ line col) = show line ++ ':' : show col
```
Alex also wants a function to call when it finds the end of input. We'll want a lexeme with a `TokEOF`, and no source. For the position information, anything we put there would be inaccurate, since source contains no characters that could have a position. We will never inspect the position information for this lexeme, so we can safely leave it `undefined`.
```
alexEOF = return $ L undefined TokEOF ""
```
Finally, we provide `lex`.
```haskell
lex :: String -> Either String [Lexeme]
lex str = runAlex str alexLex
alexLex :: Alex [Lexeme]
alexLex = do lexeme@(L _ tok _) <- alexMonadScan
if tok == TokEOF
then return [lexeme]
else (lexeme:) <$> alexLex
```
Now we have a more powerful lexer that we can use for parsing.
Telling Alex About Filenames
This lexer is definitely powerful, but we're about to run into a problem. If we were committed to only compiling single-file programs, we would never need to know filenames. However, eventually, we're going to support multiple-module compilation. This means we're going to need to know the filename being lexed, in order to attach complete location information to it. Alex doesn't know this on its own, so we'll need to provide it. We can do this with a custom _user state_. To start, we need to change our `wrapper` from `%wrapper "monad"` to `%wrapper "monadUserState"`. The Alex-generated code is now identical to before, with two differences. It now has references to two identifiers, `AlexUserState`, and `alexInitUserState`, which we'll need to define.
The only state we care about is the filename, so `type AlexUserState = String` will suffice. We'll initialize this to `""`; `alexInitUserState = ""`.
Recall that the `Alex` monad is effectively a `StateT AlexState (Except String)` monad. To properly initialize our filename, we'll make `lex` take an extra filename parameter and then pass it in so we can read it later.
```haskell
alexInitFilename :: String -> Alex ()
alexInitFilename fname = Alex $ \s -> Right (s { alex_ust = fname }, ())
-- Utility
alexGetFilename :: Alex String
alexGetFilename = Alex $ \s -> Right (s, alex_ust s)
lex :: String -> String -> Either String [Lexeme]
lex fname input = runAlex input $ alexInitFilename fname >> init <$> alexLex
```
A Better Token Type
While we're certainly better off with our `Lexeme`s than we would be with pure tokens, they don't quite have all the information we want. Perhaps most importantly, we only have the _start_ location of each token, but when attaching these locations to the AST, we want to know _end_ locations too.
We can also make an optimization. Once `mkL` knows which _type_ of token we're making, we can sometimes dispatch on the source to get a more specific token. That is, we can change `L pos TokSpecial "("` to the much easier to handle `L pos TokLParen`. To beef up our lexer like this, we'll need some new types and a stronger `mkL` function.
To start, our current `TokenType` constructor names should really contain the word `Type`, to disambiguate from the actual `Token` type that we'll export to go with `Lexeme`. So let's change the `Tok` prefix to `TokType`. Then we'll need our complete type of tokens:
```haskell
data Token
-- ^ Special Characters
= TokLParen
| TokRParen
| TokComma
| TokSemicolon
| TokLBracket
| TokRBracket
| TokLBrace
| TokRBrace
| TokBackquote
-- ^ Literals
| TokLitInteger Text
| TokLitFloat Text
| TokLitChar Text
| TokLitString Text
-- ^ Reserved Words
| TokCase | TokClass | TokData | TokDefault | TokDeriving
| TokDo | TokElse | TokIf | TokImport | TokIn
| TokInfix | TokInfixL | TokInfixR | TokInstance | TokLet
| TokModule | TokNewtype | TokOf | TokThen | TokType
| TokWhere
-- ^ Reserved Operators
| TokTwoDots -- ".."
| TokColon | TokDoubleColon | TokEqual | TokLambda
| TokBar | TokLArrow | TokRArrow | TokAt
| TokTilde | TokPredArrow
-- ^ Other
| TokVarId Text
| TokQualVarId Text Text
| TokConId Text
| TokQualConId Text Text
| TokVarSym Text
| TokQualVarSym Text Text
| TokConSym Text
| TokQualConSym Text Text
| TokEOF
deriving Eq
-- Show source
instance Show Token where
...
```
Tokens with `Text` fields store the _literal_ source code which denotes the token. This means `TokLitString` `Text` still has the escape characters and gaps. We'll cheat a bit and use `Parsec` `TokenParser` to parse out our literals from the `Text` fields later, since Parsec's parsers are already designed to parse numbers, chars, and Strings according to Haskell rules.
We also want to store better position information. To this end, we'll add a module, `Compiler/BasicTypes/SrcLoc.hs`. Here we'll define `SrcLoc` and `SrcSpan`. Each of these is either `Real` or `Unhelpful`. A `Real` location is attached to anything that appears literally in the source. We'll attach `Unhelpful` locations to code generated by the compiler.
```haskell
data RealSrcLoc = SrcLoc !Text !Int !Int
data SrcLoc = RealSrcLoc !RealSrcLoc
| UnhelpfulLoc !Text
data RealSrcSpan = SrcSpan { srcSpanFile :: !Text
, startLine, startCol, endLine, endCol :: !Int
}
data SrcSpan = RealSrcSpan !RealSrcSpan
| UnhelpfulSpan !Text
```
We want to separate `Real` and `Unhelpful` locations at the type level, because most of our functions for working with locations will be unsafe with `Unhelpful` locations. We also don't need laziness with these types, so we make them entirely strict to avoid the overhead.
While we're at it, we can define a type `Located e = Located SrcSpan e` for attaching locations to things. Our `Lexeme` type becomes `type Lexeme = Located Token`.
Then we let `mkL` dispatch on the `TokenType` of its first argument.
```haskell
mkL :: TokenType -> AlexInput -> Int -> Alex Lexeme
mkL toktype (alexStartPos,_,_,str) len = do
fname <- alexGetFilename
alexEndPos <- alexGetPos
let AlexPn _ startLine startCol = alexStartPos
AlexPn _ endLine endCol = alexEndPos
startPos = mkSrcLoc fname startLine startCol
endPos = mkSrcLoc fname endLine endCol
srcSpan = mkSrcSpan startPos endPos
src = take len str
cont = case toktype of
TokTypeInteger -> mkTok1 TokLitInteger
TokTypeFloat -> mkTok1 TokLitFloat
TokTypeChar -> mkTok1 TokLitChar
...
return $ cont srcSpan src
mkTok1 :: (String -> Token) -> SrcSpan -> String -> Located Token
```
Actually making the cases is tedious, so it is left as an exercise (see the source). Take care in `TokQualVarSym` - `F..` lexes as `TokQualVarSym "F" "."`, and not as `[TokConId "F", TokTwoDots]`. Similarly, `F.<.>` lexes as `TokQualVarSym "F" "<.>"`.
Compiling with Alex
`Cabal` (and by extension, `Stack`) is aware of Alex. By adding `alex` to the `build-depends` section of your config file, the build system will automatically invoke Alex on your specification file. If you prefer to compile manually, then simply run `$ alex Lexer.x` and Alex will put a `Lexer.hs` file in the same folder.
Closing Remarks
I would like to start including a `test` tree in the full sources for these chapters, which will make them slower to produce. I'll try and put out text for a chapter and it's `src` tree first, and correct errors if testing finds them.
================================================
FILE: 9/9.2_parsing.md
================================================
---
title: Parsing
previous: Lexing
next: Chapter 10 Overview
---
# Parsing
Unlike the lexer, our parser produces a real AST. That means that it depends on the _infrastructure_ for representing the AST and for representing names in the source. The AST is split across declarations, which can be function declarations (which in turn contain expressions), type signatures, fixity declarations, data type declarations (which can contain type signatures), etc. This turns into a _lot_ of backing infrastructure before we can really start parsing.
## The Infrastructure
Let's dive into the infrastructure that we need to represent all of these types. First off, we have the topmost node of the AST.
```haskell
data PhModule a = Module
{ modName :: Maybe (Located Text)
, modDecls :: [LPhDecl a]
}
```
A module contains a list of declarations, which are either import statements, type/fixity signatures, data declarations, class declarations, instance declarations, function declarations, or pattern bindings. We'll leave off import statements for now.
The type of declarations is given (for now) by
```haskell
type LPhDecl a = Located (PhDecl a)
data PhDecl id
= Binding (PhBind id)
| Signature (Sig id)
| DataDecl id [id] [ConDecl id]
| ClassDecl [Pred id] -- Superclasses
id -- Class name
id -- type variable (The a in 'class Eq a where')
(LPhLocalBinds id) -- Class signatures and default implementations
| InstDecl [Pred id] -- Context
id -- Class name
(PhType id) -- The type becoming an instance
(LPhLocalBinds id) -- class function implementations
```
Constructor declarations are simply a `conid` followed by a list of types.
```haskell
data ConDecl id
= ConDecl id [PhType id]
```
We'll extend the language with record syntax later.
A `MatchGroup` is a list of match alternatives along with a context of where the match appears. After parsing, there may be several `MatchGroup`s with one alternative each for any given function name. One of the Renamer's jobs is to unite all the `MatchGroup`s for a single function into one.
```haskell
data MatchGroup id
= MG { mgAlts :: [LMatch id]
, mgCtxt :: MatchContext
}
type LMatch id = Located (Match id)
data Match id
= Match { matchPats :: [LPat id]
, rhs :: LRHS id
, localBinds :: LPhLocalBinds id
}
data MatchContext
= FunCtxt
| CaseCtxt
| LamCtxt
| LetCtxt
```
The local bindings in a `Match` are the `where` clause, if any. Each match can have its own `where` clause, which scopes over the entire right hand side, including guards.
```haskell
type LRHS id = Located (RHS id)
data RHS id
= Unguarded (LPhExpr id)
| Guarded [LGuard id]
type LGuard id = Located (Guard id)
data Guard id = Guard (LPhExpr id) (LPhExpr id)
```
If you wanted to implement `PatternGuards`, the type for `Guard` above would need quite a bit of work.
Patterns have a relatively straightforward syntax. Patterns are either a simple variable, a constructor followed by a list of patterns, an "as-pattern" like `xs@(x:_)`, a literal pattern, or a wildcard. We'll also extend the constructor case with sugar for tuples and lists.
```haskell
type LPat id = Located (Pat id)
data Pat id
= PVar id
| PCon id [Pat id]
| PAs id (Pat id)
| PLit PhLit
| PWild
| PTuple [Pat id]
| PList [Pat id]
```
The `Sig` type mentioned above can be either a `TypeSig` or a `FixitySig`. Both of these signatures can simultaneously bind to several entities.
```haskell
type LSig id = Located (Sig id)
data Sig id
= TypeSig [id] (LPhType id)
| FixitySig Assoc Int [id]
data Assoc = Infix | InfixL | InfixR
```
Types are also relatively straightforward. We can have type variables, qualified types (`Eq a => a -> a -> Bool`), and type application. This is enough to describe the entire type system (and in fact is pretty close to the Core type system we'll use later), but we additionally want to extend it with explicit types for the syntax-sugar types that are built-in.
```haskell
type LPhType id = Located (PhType id)
data PhType id
= PhVarTy id
| PhBuiltInTyCon BuiltInTyCon
| PhQualTy [Pred id] (LPhType id)
| PhAppTy (LPhType id) (LPhType id)
| PhFunTy (LPhType id) (LPhType id)
| PhListTy (LPhType id)
| PhTupleTy [LPhType id]
data BuiltInTyCon
= UnitTyCon
| ListTyCon
| FunTyCon
| TupleTyCon Int
```
The `BuiltInTyCon` type is a stand-in until we have an internal representation of these types laid out. Later on, we'll either have the parser produce the internal representation instead of the `BuiltInTyCon` type, or we can have the renamer do a replacement.
The type `Pred id` represents an assertion that a type "is in" a typeclass.
```haskell
data Pred id = IsIn id (PhType id)
```
Finally, that leaves the syntax of expressions themselves. Haskell allows a wide range of expressions, most of which will be desugared into a smaller subset called the _kernel_. The kernel is then easily translated into `Core`.
```haskell
type LPhExpr id = Located (PhExpr id)
data PhExpr id
= PhVar id
| PhLit PhLit
| PhLam (MatchGroup id)
| PhApp (LPhExpr id) (LPhExpr id)
| OpApp (LPhExpr id) -- left operand
(LPhExpr id) -- operator, should be PhVar
(LPhExpr id) -- right operand
| NegApp (LPhExpr id) -- syntactic negation
| PhCase (LPhExpr id) (MatchGroup id)
| PhIf (LPhExpr id) (LPhExpr id) (LPhExpr id)
| PhLet (LPhLocalBinds id) (LPhExpr id)
| PhDo [LStmt id]
| ExplicitTuple [LPhTupArg id]
| ExplicitList [LPhExpr id]
| ArithSeq (ArithSeqInfo id)
| Typed (LPhType id) (LPhExpr id)
data PhLit
= LitInt Int
| LitFloat Double
| LitChar Char
| LitString Text
```
We leave some room for future expansion, to avoid being (completely) overwhelmed. Notably, operator sections and list comprehensions are missing, but will be fairly easily included later.
Local bindings can appear in many places. Bindings can be accompanied by `Sig`s, and can bind functions or patterns.
```haskell
type LPhLocalBinds id = Located (PhLocalBinds id)
type LPhBind id = Located (PhBind id)
data PhLocalBinds id = LocalBinds [LPhBind id] [LSig id]
data PhBind id
= FunBind id (MatchGroup id)
| PatBind (LPat id) (LRHS id)
```
Note that pattern bindings can appear at the top level - while using this is rare, the following is a valid Haskell module (up to undeclared identifiers).
```haskell
module Example where
Right value = runExcept (return someVal)
x | someBool = 0
| otherwise = 1
```
The first pattern binds `value` in the toplevel of the module, while the second binds `x`. If no guard is `True` in a pattern binding, then it is an error when `x` is evaluated. If `x` is never evaluated, then no error should be raised. We can safely translate bindings such as these into a sequence of `if/then/else` expressions during desugaring.
```haskell
x = if someBool then 0 else if otherwise then 1
else error "Non-exhaustive guards in pattern binding for `x'"
```
Inside `do` blocks, we can have three types of statements.
```haskell
type LStmt id = Located (Stmt id)
data Stmt id
= SExpr (LPhExpr id)
| SGenerator (LPat id) (LPhExpr id)
| SLet (LPhLocalBinds id)
```
This is a toy language, so I'm not sure how involved I want the later work to be. In an ideal world, I'd like to implement the `MonadFail` proposal from the get-go, which would raise a compile-time error if the pattern in a generator statement is not a `PVar` pattern and the `Monad` is not an instance of `MonadFail`. This adds work to the type checker and to the desugarer, as well as adding to the list of entities that need to be wired in.
We've left open the opportunity to support `TupleSections` easily.
```haskell
type LPhTupArg id = LPhExpr id
```
To support `TupleSections`, one could replace this with a sum type, where one case is a present argument and the other is a missing argument.
Finally we have arithmetic sequences. These are straightforward.
```haskell
data ArithSeqInfo id
= From (LPhExpr id)
| FromThen (LPhExpr id)
(LPhExpr id)
| FromTo (LPhExpr id)
(LPhExpr id)
| FromThenTo (LPhExpr id)
(LPhExpr id)
(LPhExpr id)
```
Between these types we have the frontend syntax covered. It remains to figure out how to parse it.
# The Parser
(NB. I may switch to `MegaParsec` when version 8 is released, partly because of the better potential for nice, custom error messages, and partly because of the nicer *default* error messages. Setting up `MegaParsec` to work with custom token streams is much more involved than setting up `Parsec`.)
There are two ways to parse a layout-sensitive grammar. The first way is to translate the token stream from a layout-sensitive form to a layout-*in*sensitive form. For Haskell, this can be done by inserting `{`, `}`, and `;` tokens into the correct places. Unfortunately, by throwing away indentation information in this fashion, we make parse errors drastically worse. GHC takes this approach.
The other way is to make the parser layout-sensitive. This is the approach we will take.
A number of corner cases in the Haskell grammar make the layout-sensitive parsing fairly involved. As a particularly nasty example:
```haskell
let (x, y) = ("te\
\st", 5) in x
```
This is nasty to parse because it is valid, but the comma on the second line *shouldn't be tested for indentation*. In fact any multi-line string causes the rest of that line to ignore indentation rules. This is very difficult to convey to a parser.
Thankfully, we don't have to leave this issue to the parser to resolve. The Haskell Report recommends introducing a token representing the indentation at the first token of every line, provided that that token is preceded only by whitespace. We can add this capability to our lexer. Since our parser will know the column of the token, it's not important to include the indentation amount.
We could weave this into `Alex` itself, but it's easier to just add this information in a separate pass.
```haskell
insertIndentationToks :: [Lexeme] -> [Lexeme]
insertIndentationToks [] = []
insertIndentationToks (l@(Located srcSpan _) : ls) =
(noLoc TokIndent) : go (l : ls)
where go [] = []
go [l] = [l]
go (l1@(Located s1 _) : l2@(Located s2 _) : ls) =
-- Test if token l2 is the first on it's line, including the end of token l1
if (unsafeLocLine $ srcSpanEnd s1) < (unsafeLocLine $ srcSpanStart s2)
-- If it is, insert indent token
then l1 : noLoc TokIndent : go (l2 : ls)
else l1 : go (l2 : ls)
```
Note the indentation token at the start. This will enforce that the indentation of the first token (if it's not the keyword `module`, in which case we'll just ignore the indentation) dictates the indentation of the rest of the file.
This will also make it easier to parse other aspects of indentation sensitivity; we can simply make our primitive `satisfy` parser guard check for indentation tokens, and, if it sees one, guard the indentation of that token against the indentation of the current layout context, if any. So let's set up our parser type.
```haskell
type Parser a = Parsec [Lexeme] ParseState a
data ParseState = ParseState
{ compFlags :: Flags
, indentOrd :: Ordering
, layoutContexts :: [LayoutContext]
, endOfPrevToken :: SrcLoc
}
data LayoutContext
= Explicit
| Implicit Int
initParseState flags = ParseState flags EQ [] noSrcLoc
```
Parser combinators are *context-free*. This means that a production will act the same no matter where it appears in the grammar. However, the Haskell grammar is *context-sensitive*. That is, some tokens could be interpreted as different parts of the parse tree based on indentation alone, regardless of the rule that accepts them. To handle this, the parser needs to be stateful. By using state to track the context, we can check the context to make correct decisions. Here's the breakdown.
- `compFlags` will be used to make decisions about parsing in the presence of compiler flags. For example, `TupleSections` makes `(a, b,)` legal anywhere that `\c -> (a, b, c)` is legal.
- `indentOrd` tells the indentation guard what the relative ordering between the reference indentation and the next token needs to be.
- `LayoutContexts` is the major player. By tracking the reference indentations in a stack, whenever we run into a parse error at the end of block, we can simply pop a layout context and try again.
- `endOfPrevToken` will be used to implement the `locate` combinator, which takes a parser and wraps the result in `Located`.
Then we can implement a primitive `satisfy` parser than handles indentation guards, and build everything else on top of that.
```haskell
satisfy :: (Token -> Bool) -> Parser Lexeme
satisfy p = try $ guardIndentation *> satisfyNoIndentGuard <* setIndentOrdGT
where setIndentOrdGT = modify $ \s -> s { indentOrd = GT }
satisfyNoIndentGuard :: (Token -> Bool) -> Parser Lexeme
satisfyNoIndentGuard p = do
lexeme@(Located pos _) <- Parsec.tokenPrim
prettyShowToken
posFromTok
testTok
modify $ \s -> s { endOfPrevToken = mkSrcPos $ srcSpanEnd pos }
return lexeme
guardIndentation :: Parser ()
guardIndentation = do
check <- optionMaybe $ satisfyNoIndentGuard (== TokIndent)
ord <- gets indentOrd
when (isJust check || ord == EQ) $ do
mr <- currentLayoutContext
case mr of
Nothing -> return ()
Just Explicit -> return ()
Just (Implicit r) -> do
c <- sourceColumn <$> getPosition
when (c `compare` r /= ord) $ unexpected "indentation"
Parsec.> "indentation of " ++ show r ++
" (got " ++ show c ++ ")"
```
(**Idiom**: the operators `(*>) :: Applicative f => f a -> f b -> f b` and `(<*) :: Applicative f => f a -> f b -> f a` can be used to pick out a particular result from a sequence of applicative (or monadic) actions. We also have `between l r x = l *> x <* r`, but I prefer to use between when `l` and `r` are symmetric, unlike here.)
The `try` around `satisfy` is necessary, because `guardIndentation` might consume a `TokIndent`. In Parsec, token consumption affects behavior. Normally, consuming a token is 1) irreversible and 2) resets "expected" error messages. We may need to check the indentation of a token multiple times if the token could belong to several different implicit layouts. For example;
```haskell
let x = do y <- foo
return $ bar y
a = x
in a
```
We'll end up testing the indentation of `a` against the reference for the `do` block, and when that fails we'll need to test it *again* against the indentation of the `let` block. The `try` combinator turns off both effects of token consumption. If the parser fails and consumes input, then we get both backtracking and good error messages! `try` is dangerous for performance when it can cause nested "backtracking trees", but this simple one-shot case won't cause a big complication even if used inside another `try` wrapper.
The check when `ord == EQ` is also necessary, otherwise we fail to (cleanly) reject programs like `let x = 1 y = 2 in x + y`, which should be written as
```haskell
let x = 1
y = 2 in x + y
```
We'll provide a way to set `indentOrd` to `EQ`:
```haskell
align :: Parser ()
align = modify $ \s -> s { indentOrd = EQ }
```
We can also replace `Parsec`'s `label` combinator with a more useful, layout-sensitive one.
```haskell
label :: Parser a -> String -> Parser a
label p lbl = do
mctx <- currentLayoutContext
case mctx of
Nothing -> Parsec.label p lbl
Just Explicit -> Parsec.label p lbl
Just (Implicit n) -> labelWithIndentInfo p lbl n
where
labelWithIndentInfo p lbl n = do
ord <- gets indentOrd
let ordPiece = case ord of
EQ -> show n
GT -> "greater than " ++ show n
LT -> "less than " ++ show n
indentPiece = "at indentation"
Parsec.label p $ unwords [lbl, indentPiece, ordPiece]
(>) = label
```
Then we can start throwing up basic parsers in terms of satisfy.
```haskell
token :: Token -> Parser Lexeme
token t = satisfy (== t) > prettyShowToken
oneOf, noneOf :: [Token] -> Parser Lexeme
oneOf ts = satisfy (`elem` ts)
noneOf ts = satisfy (`notElem` ts)
anyToken :: Parser Lexeme
anyToken = satisfy (const True)
reserved :: String -> Parser Lexeme
reserved word = satisfy (== reservedIdToTok word)
reservedOp :: String -> Parser Lexeme
reservedOp op = satisfy (== reservedOpToTok op)
parens, braces, brackets, backticks :: Parser a -> Parser a
parens = between (token TokLParen) (token TokRParen)
...
comma :: Parser ()
comma = void $ token TokComma
commaSep :: Parser a -> Parser [a]
commaSep p = p `sepBy` comma
semicolon :: Parser ()
semicolon = void $ token TokSemicolon
-- | Separates Haskell expressions by arbitrary numbers of semicolons
stmtSep :: Parser a -> Parser [a]
stmtSep p = many semicolon >> p `sepEndBy` many1 semicolon
-- | Separates at least 1 Haskell expression by arbitrary numbers of semicolons
stmtSep1 :: Parser a -> Parser [a]
stmtSep1 p = many semicolon >> p `sepEndBy1` many1 semicolon
```
With these (and some more) tools in tow, we can define our context-sensitive combinators. Then to parse the grammar we'll just defer to these and otherwise forget about layout!
```haskell
locate :: Parser a -> Parser (Located a)
locate p = do
startPos <- getPosition
let srcName = sourceName startPos
startLine = sourceLine startPos
startCol = sourceColumn startPos
startLoc = mkSrcLoc (T.pack srcName) startLine startCol
res <- p
endPos <- gets endOfPrevToken
return $ Located (mkSrcSpan startLoc endPos) res
openExplicit :: Parser ()
openExplicit = token TokLBrace >> pushLayoutContext Explicit
closeExplicit :: Parser ()
closeExplicit = token TokRBrace >> popLayoutContext
openImplicit :: Parser ()
openImplicit = do
c <- sourceColumn <$> getPosition
pushLayoutContext $ Implicit c
closeImplicit :: Parser ()
closeImplicit = popLayoutContext
block :: Parser a -> Parser [a]
block p = explicitBlock <|> implicitBlock
where
explicitBlock = between openExplicit closeExplicit
$ stmtSep p
implicitBlock = between openImplicit closeImplicit
$ concat <$> many (align >> stmtSep1 p) <|> return []
block1 :: Parser a -> Parser [a]
block1 p = explicitBlock1 <|> implicitBlock1
where
explicitBlock1 = between openExplicit closeExplicit
$ stmtSep1 p
implicitBlock1 = between openImplicit closeImplicit
$ concat <$> many1 (align >> stmtSep1 p)
```
In the definition of `block`, the last line looks pretty strange. Why write `concat <$> many (align >> stmtSep1 p) <|> return []` instead of merely `concat <$> many (align >> stmtSep p)`? The problem is that `align >> stmtSep p` can succeed without consuming any tokens, because `stmtSep` accepts the empty production. This would cause `many` to hang, but Parsec actually notices that this has happened and instead raises an error. So we have to represent a possibly empty implicit block as "Either an implicit block with something in it, or an empty implicit block." Also notice that, for a similar reason, `block1` needs `many1 (align >> stmtSep1 p)`. If we simply used `many`, then `implicitBlock1` would accept an empty block.
The structure of the parser itself follows pretty directly from the grammar in the [Haskell 2010 report](https://www.haskell.org/onlinereport/haskell2010/haskellch10.html#x17-18000010.5), so rather than go into it in detail here, I suggest reading the source. The key differences are that we won't (yet) support pattern guards or tuple sections.
## [Full Source](https://github.com/JKTKops/ProtoHaskell/tree/465adc2c7758755994eb8b9c6ec0385d961f1cb9)
Note that immediately after releasing this chapter, I defined the `FastString` type. `OccNames` now contain a `FastString`, and a `FastString` contains the `Text`.
## Closing
This is probably all I'm going to say about the parser unless you guys want more.
There are also some basic tests in the `/test/Compiler/Parser/testcases` directory. I'll be using (mostly) "golden testing" for this project, similar to GHC. The test cases and their expected output is kept in files in the test directory, discovered by the testing engine (see `test/Test.hs`) and run against the expected output. I'm using `tasty-golden` for this. I also have a few small `tasty-hunit` tests. I think as the project progresses, there will continue to be a mix of both. It's staying simple for now, and it will be upgraded as we progress!
================================================
FILE: Contributing.md
================================================
# Contributing
Since a compiler is such a large project, it's virtually guaranteed that I will make mistakes and bad design decisions. I may also write some inefficient code, or code which could be clearer.
Also, I'm still just a college student, and as such it's very likely that I may need to spend a lot of time on my school work and not have as much time to work on ProtoHaskell as I'd like.
So maybe you'd like to help by fixing a mistake I've made. Maybe you just had an improvement on some code that I wrote. You might have some big-picture ideas for the design or some smaller, detailed ideas for a particular section. You might even want to help with major contributions!
If you belong to the latter (that is, you want to be a significant contributor to this project) then your best bet to contact me is probably via Reddit at u/JKTKops. I'm also responsive to emails, but I'll see a Reddit PM faster (this probably isn't a good thing - oh well).
Otherwise, please open an issue (or pull request, if you've already fixed it!) describing
1) What issue you've found (if any) and/or the idea you have (if any)
2) Propose a fix for the issue if you have one (but please open an issue anyway if you don't!)
3) If the above affects existing code, please try and identify which files (including markdown writeups containing copies of the source code) will be affected. You don't have to fix (or even find) these, but tackling this as issues arise will help make sure that the writeup and the code stay synced. It will also help ensure that the code stays synced _across chapters_.
4) If you haven't done (3), then expect some time to pass while I work on the issue or PR to identify everything that needs to be updated as a result of the changes.
If you fix a subtle issue, or otherwise have code in a PR that looks funky or confusing, _please_ comment it. Since this is a tutorial, it's very likely that some relatively newer Haskellers may look to this repository and project for examples, and comments will be critical. If the comment is long, or the same comment appears in multiple places, follow GHC's example by making a block comment somewhere in a relevant file that begins with `Note: []` and then a newline. Reference this note from the relevant places with `-- See Note: []`. If you are referencing a Note which is in a different file, then reference the note with `See Note: [] in path/to/file/from/src`. For example,
```
See Note: [Deriving MonadState [s]] in Control/Monad/Supply.hs
```
This makes it easy to find the correct note by using search functionality to look for the bracketed description, which shouldn't appear anywhere in the actual source.
That's it! I hope that this will be able to accommodate those who have simply noticed an error that I should fix, as well as those wish to help fix such errors and/or make more significant contributions.
================================================
FILE: Overview.md
================================================
The goal of this project is to try and, in some sense, complete Steven Diehl’s Write You a Haskell. I don’t know if my continuation will have the same level of detail as Diehl’s original work. I plan on having my own project grow linearly as this progresses, but I will attempt to save the work into a separate Github repository at key milestones as I progress. This is a learning experience for me too - code won’t be perfect, and I may make large mistakes. I’m following algorithms presented by papers where possible, and GHC itself otherwise.
I don’t plan on targeting C or llvm. I plan on targetting Java. Partly this is because I’ve written garbage collectors before and don’t feel the need to do it again. UPDATE: As of returning to this project, I am now considering targeting C instead, for two reasons. It makes the discussion of the runtime system more interesting, and should make it considerably easier to add an NCG in the future. Considerations will continue. I may start by targeting Java and then have a section later about adding a C backend. A Java backend will (probably) not have an FFI that can understand subclasses. See Eta if you want that.
We’ll be beginning after chapter 7, with the Poly language as a starting point. We’ll be building off of plans laid out in chapter 8, and using some of the parsing ideas presented in chapter 9. The parser will be an Alex lexer + Parsec parser, purely because I’m more comfortable with Parsec then with Happy. Feel free to use Happy instead, as an exercise.
Finally, before beginning, a few words which Diehl omits, presumably since the project never reached a point where they would matter. A compiler, even for a toy language, can become a monolithic piece of software, and there will be lots of code, in lots of files, many of which have similar names (PhExpr, CoreExpr, STGExpr, all containing a type called Expr). Therefore it is critically important that we remain organized and make good use of qualified imports. I think I have a minority opinion that GHC’s use of unqualified module names can make dependencies hard to find and follow for little gain. I’ve rearranged the modules from Chapter 7 as follows:
| - Compiler
| - Parser
| - Lexer.hs
\ - Parser.hs
| - PhSyn
| - PhExpr.hs # The abstract syntax of
ProtoHaskell expressions
\ - PhType.hs # The abstract representation of
ProtoHaskell types
| - TypeCheck # Everything related to type-checking
Eventually includes Core
\ - TcInfer.hs
\ - Types # Wide-spread type utilities, like TyEnv
\ - TyEnv.hs
| - Control
\ - Monad # Custom monadic logic will live here
If it is needed in many places
| - Supply
| - Class.hs
\ - Supply.hs # Monad Transformer for monads which store
a supply of some type. Thin wrapper
around StateT, and will usually be used
for names.
| - Interpreter
| - Eval.hs
\ - Main.hs
\ - Utils
\ - Outputable.hs # Re-exports most of the pretty
library and an Outputable class
This may seem like overkill now, but the organization will be nice later, even if several folders never grow beyond 2 to 4 modules.
By default, I have the following extensions enabled:
I won’t include LANGUAGE pragmas for these extensions. Note that FlexibleContexts is not on (or implied by) this list. I may occasionally comment on their usage, since I suspect this document will turn out to be attractive towards relatively new Haskellers.
For the same reason as the above, I will try to provide very brief overviews of Haskell idioms the first time they appear, usually a sentence or less.
Finally, it is not my goal that the ProtoHaskell Compiler should be able to compile itself. Control.Monad.Supply.Class already uses UndecidableInstances and I would prefer ease of coding over restricting the subset of GHC-Haskell that I use.
[Table of Contents](table_of_contents)
The goal of this project is to try and, in some sense, complete Steven Diehl's Write You a Haskell. I don't know if my continuation will have the same level of detail as Diehl's original work. I plan on having my own project grow linearly as this progresses, but I will attempt to save the work into a separate Github repository at key milestones as I progress. This is a learning experience for me too - code won't be perfect, and I may make large mistakes. I'm following algorithms presented by papers where possible, and GHC itself otherwise.
I don't plan on targeting `C` or `llvm`. I plan on targetting `Java`. Partly this is because I've written garbage collectors before and don't feel the need to do it again. **UPDATE**: As of returning to this project, I am now considering targeting `C` instead, for two reasons. It makes the discussion of the runtime system more interesting, and should make it considerably easier to add an NCG in the future. Considerations will continue. I may start by targeting `Java` and then have a section later about adding a `C` backend. A `Java` backend will (probably) not have an FFI that can understand subclasses. See Eta if you want that.
We'll be beginning after chapter 7, with the `Poly` language as a starting point. We'll be building off of plans laid out in chapter 8, and using some of the parsing ideas presented in chapter 9. The parser will be an Alex lexer + Parsec parser, purely because I'm more comfortable with Parsec then with Happy. Feel free to use Happy instead, as an exercise.
Finally, before beginning, a few words which Diehl omits, presumably since the project never reached a point where they would matter. A compiler, even for a toy language, can become a monolithic piece of software, and there will be lots of code, in lots of files, many of which have similar names (`PhExpr`, `CoreExpr`, `STGExpr`, all containing a type called `Expr`). Therefore it is _critically important_ that we remain organized and make good use of qualified imports. I think I have a minority opinion that GHC's use of unqualified module names can make dependencies hard to find and follow for little gain. I've rearranged the modules from Chapter 7 as follows:
```
| - Compiler
| - Parser
| - Lexer.hs
\ - Parser.hs
| - PhSyn
| - PhExpr.hs # The abstract syntax of
ProtoHaskell expressions
\ - PhType.hs # The abstract representation of
ProtoHaskell types
| - TypeCheck # Everything related to type-checking
Eventually includes Core
\ - TcInfer.hs
\ - Types # Wide-spread type utilities, like TyEnv
\ - TyEnv.hs
| - Control
\ - Monad # Custom monadic logic will live here
If it is needed in many places
| - Supply
| - Class.hs
\ - Supply.hs # Monad Transformer for monads which store
a supply of some type. Thin wrapper
around StateT, and will usually be used
for names.
| - Interpreter
| - Eval.hs
\ - Main.hs
\ - Utils
\ - Outputable.hs # Re-exports most of the pretty
library and an Outputable class
```
This may seem like overkill now, but the organization will be nice later, even if several folders never grow beyond 2 to 4 modules.
By default, I have the following extensions enabled:
```
ApplicativeDo
BangPatterns
FlexibleInstances
FunctionalDependencies
GeneralizedNewtypeDeriving
LambdaCase
MultiParamTypeClasses
NamedFieldPuns
OverloadedStrings
PatternGuards
TupleSections
ViewPatterns
```
I won't include `LANGUAGE` pragmas for these extensions. Note that `FlexibleContexts` is _not_ on (or implied by) this list. I may occasionally comment on their usage, since I suspect this document will turn out to be attractive towards relatively new Haskellers.
For the same reason as the above, I will try to provide very brief overviews of Haskell idioms the first time they appear, usually a sentence or less.
Finally, it is _not_ my goal that the ProtoHaskell Compiler should be able to compile itself. `Control.Monad.Supply.Class` already uses `UndecidableInstances` and I would prefer ease of coding over restricting the subset of GHC-Haskell that I use.
================================================
FILE: Sources.md
================================================
# Sources
Like any complicated project, ProtoHaskell is built on the shoulders of giants. Here I'll keep a running list of papers and websites that I've read while working on this project. I'll try to cite them out of here when one is referenced from a chapter. Please feel free to add missing citations!
In no particular order:
[1] Stephen Diehl. Write You a Haskell. Retrieved from [http://dev.stephendiehl.com/fun/](http://dev.stephendiehl.com/fun/)
[2] Simon L. Peyton Jones. 1992. Implementing lazy functional languages on stock hardware: the Spineless Tagless G-machine. _Journal of Functional Programming_ 2, 2 (1992), 127–202. [Online]([https://www.microsoft.com/en-us/research/publication/implementing-lazy-functional-languages-on-stock-hardware-the-spineless-tagless-g-machine/).
[3] [GHC wiki](https://gitlab.haskell.org/ghc/ghc/wikis/home)
[4] [GHC Source Tree](https://gitlab.haskell.org/ghc/ghc/tree/master). Many ideas and some implementations are taken and adapted from GHC sources. Header comments on such modules will indicate when this is the case.
================================================
FILE: _config.yml
================================================
# Welcome to Jekyll!
#
# This config file is meant for settings that affect your whole blog, values
# which you are expected to set up once and rarely edit after that. If you find
# yourself editing this file very often, consider using Jekyll's data files
# feature for the data you need to update frequently.
#
# For technical reasons, this file is *NOT* reloaded automatically when you use
# 'bundle exec jekyll serve'. If you change this file, please restart the server process.
#
# If you need help with YAML syntax, here are some quick references for you:
# https://learn-the-web.algonquindesign.ca/topics/markdown-yaml-cheat-sheet/#yaml
# https://learnxinyminutes.com/docs/yaml/
#
# Site settings
# These are used to personalize your new site. If you look in the HTML files,
# you will see them accessed via {{ site.title }}, {{ site.email }}, and so on.
# You can create any custom variable you would like, and they will be accessible
# in the templates via {{ site.myvariable }}.
title: Write You a Haskell 2
email: zerglingk9012@gmail.com
description: >- # this means to ignore newlines until "baseurl:"
A continuation of Stephen Diehl's Write You a Haskell
baseurl: "/Write-You-a-Haskell-2" # the subpath of your site, e.g. /blog
url: "jktkops.github.io" # the base hostname & protocol for your site, e.g. http://example.com
github_username: jktkops
# Build settings
theme: jekyll-theme-midnight
plugins:
- jekyll-feed
# Exclude from processing.
# The following items will not be processed, by default.
# Any item listed under the `exclude:` key here will be automatically added to
# the internal "default list".
#
# Excluded items can be processed by explicitly listing the directories or
# their entries' file path in the `include:` list.
#
# exclude:
# - .sass-cache/
# - .jekyll-cache/
# - gemfiles/
# - Gemfile
# - Gemfile.lock
# - node_modules/
# - vendor/bundle/
# - vendor/cache/
# - vendor/gems/
# - vendor/ruby/
================================================
FILE: _layouts/default.html
================================================
{% seo %}