Warning
This book is new. If you'd like to go through it, then join the Learn Code Forum to get help while you attempt it. I'm also looking for feedback on the difficulty of the projects. If you don't like public forums, then email help@learncodethehardway.org to talk privately.
Exercise 33: Parsers
Imagine you're given a huge list of numbers and you have to enter them into a spreadsheet. At first, this huge list is just a raw stream of digits separated by spaces. Your brain automatically breaks the stream of digits at the spaces and creates numbers. That's your brain acting like a scanner. You then take each number, and enter them into rows and columns that have meaning. Your brain is acting like a parser by taking the flat stream of numbers (tokens), and turning them into a 2-dimensional grid of more meaningful rows and columns. The rules you follow for what numbers go into what rows and what columns is your "grammar," and a parser's job is to enforce the grammar just like you would with a spreadsheet.
Let's look at the example Puny Python code from Exercise 32 one more time and discuss parsers from three different viewing angles:
def hello(x, y): print(x + y) hello(10, 20)
When you look at this code what do you see? I see a tree, similar to the BSTree or TSTree we created earlier. Do you see the tree? Let's start at the very beginning of this file and learn how to go from the characters to a tree.
First, when we load a .py file it is just a stream of "characters"--actually bytes, but Python uses Unicode so...characters will have to do. These characters are in a line, with zero structure to them, and the job of our scanner is to add the first level of meaning. The scanner does this by using regular expressions to extract meaning from the stream of characters, creating a list of tokens. We have gone from a list of characters to a list of tokens, but look at the def hell(x,y): function. That is a function with a block of code inside it. This means some form of "containment" or "thing inside a thing" structure.
A really easy way to represent containment is with a tree. We could use a table like your spreadsheet, but it's not as easy to work with as a tree. Look at the hello(x, y) part next. We have a NAME(hello) token, but we want to grab the contents of the (...) part and know that it is inside the parenthesis. Again, we can use a tree, and we "nest" the x, y part inside the (...) part as a child or branch of the tree. Eventually we would have a tree starting at the root of this Python code, and each block, print, function definition, and function call would be branches from root, which would have children, and so on.
Why would we do this? We need to know the structure of the Python code based on its grammar so that we can analyze it later. If we don't convert the linear list of tokens into a tree structure, then we have no idea where the boundaries of functions, blocks, branches, or expressions are. We'll have to determine the boundaries on the fly in a "straight line" way, which isn't very easy to do reliably. Plenty of early terrible languages are straight line languages, and we now know they don't have to be. We can use a parser to build the tree structure.
The job of a parser is to take the list of tokens from the scanner and translate them into a more meaningful tree of grammar. You can think of a parser as applying another regular expression to the token stream. The scanner's regex boxes up chunks of characters into tokens. The parser's "regex" puts those tokens inside boxes, that have boxes inside them, and so on until the tokens are no longer linear.
The parser also adds meaning to these boxes. A parser would simply drop the () parenthesis tokens and just create a special parameters list for a possible Function class. It would drop colons, useless spaces, commas, any tokens that don't actually add meaning, and translate them into nested structures that are easier to process. The end result might look like this fake tree for the above sample code:
* root * Function - name = hello - parameters = x, y - code: * Call - name = print - parameters = * Expression - Add - a = x - b = y * Call - name = hello - parameters = 10, 20
Study how I went from the above sample Python code to this fictional tree that represents the code before you move on. It's not difficult to understand, but it's key that you be able to look at the Python code and figure out the tree structure.
Recursive Descent Parsing
There are a few established ways to create a parser for this kind of grammar, but the simplest to start with is called a Recursive Descent Parser (RDP). I actually taught you this topic in my Learn Python The Hard Way, Exercise 49. You created a simple RDP parser to handle your little game language, and you didn't even know it. In this exercise I'll give a more formal description of how to write an RDP parser, then let you attempt it for our little snippet of Python above.
An RDP uses mutually recursive function calls that implement the tree structure of a given grammar. The code for an RDP parser looks like the actual grammar you're processing, and they're simple to write as long as you follow some rules. The two disadvantages of an RDP parser are: they might not be very efficient and you write them by hand usually, so they'll have more errors than a generated parser. There are also some theoretical limitations regarding what an RDP parser can parse, but since you write them by hand you can usually work around a lot of those limitations.
To write an RDP parser you need to process your scanner tokens using three main operations:
- peek
- Return if the next token could match, but don't take it off the stream.
- match
- Match the next token, taking it off the stream.
- skip
- Skip the next token as it's not needed, taking it off the stream.
You'll notice these are the three operations I told you to create for your Scanner in Exercise 33, and this is why. You need them for doing an RDP parser.
You use these three functions to write grammar parsing functions that grab tokens from the scanner. A short example from this exercise is parsing this simple function:
def function_definition(tokens): skip(tokens) # discard def name = match(tokens, 'NAME') match(tokens, 'LPAREN') params = parameters(tokens) match(tokens, 'RPAREN') match(tokens, 'COLON') return {'type': 'FUNCDEF', 'name': name, 'params': params}
You can see I'm simply taking the tokens and using match and skip to process them. You'll also notice I have a parameters function, which is the "Recursive" part of "Recursive Descent Parser". The function_definition calls parameters when it needs to parse the parameters to a function.
BNF Grammars
It's a little tricky to attempt to write an RDP parser from scratch without some form of specification of the grammar. Do you remember when I asked you to convert one regex into a FSM? It was hard right? It required a lot more code than just the few characters in the regex. When you write the RDP parser for this exercise you'll be doing a similar thing, so it'll help to have a language that's "regex for grammars".
The most common "regex for grammars" is called a Backus–Naur Form (BNF), named after the creators John Backus and Peter Naur. A BNF describes the tokens required and how those tokens repeat to form the grammar for a language. BNF also uses the same symbols as regular expressions, so *, +, and ? have similar meanings.
For this exercise I'm going to use the IETF Augmented BNF syntax at https://tools.ietf.org/html/rfc5234 to specify the grammar for the Puny Python snippet above. ABNF operators are mostly the same as regex except, for some odd reason, they put repetition before the thing to repeat not after. Other than that, read the specification and try to figure out what the following means:
root = *(funccal / funcdef) funcdef = DEF name LPAREN params RPAREN COLON body funccall = name LPAREN params RPAREN params = expression *(COMMA expression) expression = name / plus / integer plus = expression PLUS expression PLUS = "+" LPAREN = "(" RPAREN = ")" COLON = ":" COMMA = "," DEF = "def"
Let's look at only the funcdef line and compare it to the function_definition Python code we have above, matching each part:
- funcdef =
- This we replicate with the def function_definition(tokens), and it starts this part of our grammar.
- DEF
- This is specified in the grammar as DEF = "def", and in the Python code I just skip it with skip(tokens).
- name
- I need this, so I match it with name = match(tokens, 'NAME'). I use the convention of CAPITALS to indicate things in the BNF I most likely will skip.
- LPAREN
- I assumed I received a def, but now I want to enforce a ( so I match it, but I ignore the result using match(tokens, 'LPAREN'). It's like "required but skipped".
- params
- In the BNF I define params as a new "grammar production" or "grammar rule". That means in my Python code I need a new function. In this function, I can call that function with params = parameters(tokens). Later I define the parameters function to process comma separated parameters for functions.
- RPAREN
- I discard this but require it again with match(tokens, 'RPAREN').
- COLON
- And once again, discard the match match(tokens, 'COLON').
- body
- I'm actually skipping the body here because Python's indented syntax is a little too hard for this simple example code. You will need to handle this problem in the exercise unless you feel like going for it.
That is basically how you read an ABNF specification and systematically translate it into code. You start at the root, implement each grammar production as a function, and leave the scanner to handle the simple tokens (which I denote with CAPITAL letters).
Quick Demo Hack Parser
Here's my quick hack RDP parser that you can use to base your more formal and clean parser on:
You'll notice I'm using a scanner module I wrote that has my match, peek, skip, and scan functions. I use a from scanner import * only to make this example more understandable. You should use your Scanner class.
You'll notice that I put my ABNF for this little parser in the documentation comments to each function. This helped me write each parser code and could be used for error reporting later. You should study this parser, possibly even as a master copy, before you attempt the Exercise Challenge.
Exercise Challenge
Your next challenge is to combine your Scanner class with a freshly written Parser class that you can subclass and re-implement with my simple parser here. Your base Parser class should be able to:
- Take a Scanner object and execute itself. You can assume any default function is the start of a grammar.
- Have error handling code that is better than my simple assert usage here.
You should then implement PunyPythonPython, which can parse just this tiny little Python language and do the following:
- Instead of just producing a list of dicts, you should create classes for each grammar production result. These classes then become objects inside a list.
- The classes only need to store the tokens that are parsed, but be prepared to do more later.
- You only have to parse this little language, but you should attempt to solve the "Python indent" problem. You might have to change the Scanner to be smart about only matching INDENT whitespace at the beginning of a line, and ignoring it everywhere else. You will also need to keep track of how much you've indented, and also record a 0 indent so you can "dedent" the blocks.
An extensive test suite would involve handing more samples of this tiny little python to the parser, but for now just get this one little file to parse. Try to have good coverage in the test and exploit as many errors as you can.
Study Drills
This exercise is huge, so just get it done. Take your time, and chip away at this a little at a time. I highly recommend studying my tiny little sample here until you've fully figured it out and also printing out the tokens being processed at key points.
Further Study
Check out the SLY Parser Generator by David Beazley for a way to have a computer figure out your parser and scanner (aka lexer) for you. Feel free to attempt to replicate this exercise with SLY for comparison.