This book is new. If you'd like to go through it, then join the Learn Code Forum to get help while you attempt it. I'm also looking for feedback on the difficulty of the projects. If you don't like public forums, then email firstname.lastname@example.org to talk privately.
Exercise 34: Analyzers
You now have a parser that should be producing a tree of grammar production objects. I'll call this your "parse tree," and it means you can analyze the whole program by starting at the top of the parse tree and then "walking" it until you've visited every node. You've done something like this when you were learning about the BSTree and TSTree data structures. You started at the top and visited each node, and the order you visited them in (depth-first, breadth-first, in-order, etc.) determined how the nodes were processed. Your parse tree has the same capability, and your next step in writing the little Python interpreter is to walk the tree and analyze it.
The Analyzer's job is to find semantic mistakes in your grammar and to fix or add information the next stage needs. The semantic mistakes are errors that, while grammatically correct, don't make sense as a Python program. This can be anything from a variable that hasn't been defined yet to non-sequitur code that simply makes no sense. Some language grammars are so loose the Analyzer has to do more work to fix the parse tree. Other languages are so easy to parse and process that they don't even need an Analyzer step.
To write an analyzer you'll need a way to visit each node in your parse tree, analyze it for errors, and fix any missing information. There are three general ways you can do this:
- You create an analyzer that knows how to update each grammar production. It will walk the parse tree in a similar way as your Parser did, with a function for each type of production, but its job is to alter, update, and check the productions.
- You change your grammar productions so they know how to analyze their own state. Then your analyzer is simply an engine that walks the parse tree calling each production's analyze() method. With this style you'll need some state that is passed around to each grammar production class, which should be a third class.
- You create a separate set of classes that implement the final analyzed tree you can hand to an interpreter. In many ways you'd mirror the parser's grammar productions with a set of new classes that take a global state, a grammar production, and configure their __init__ so that they're the analyzed result.
I recommend either #2 or #3 for your Exercise Challenge today.
The "Visitor Pattern" is a very common technique in object-oriented languages where you create classes that know what they should do when "visited". This lets you centralize the code for processing a class to that class. The advantage of this is you don't need large if-statements that check types on classes to know what to do. Instead, you just create a class similar to this:
class Foo(object): def visit(self, data): # do stuff to self for foo
Once you have this class (and visit can be called anything), you then just go through a list and call it:
for action in list_of_actions: action.visit(data)
You'll use this pattern for both #2 or #3 styles of analyzers; the only difference is the following:
- If you decide that your grammar productions will also be the analysis results, then your analyze() function (that's our visit()) simply stores that data in the production class or in a state that's given to it.
- If you decide that your grammar productions will produce another set of classes for the interpreter (see Exercise 35), then each call to analyze will return a new object that you put into a list for later, or attach as children to the current object.
I'm going to cover the first situation where your grammar productions are also your analyzer results. This works for our simple little Puny Python script, and you should follow along with this style. If you want to try the other design later then you can.
A Short Puny Python Analyzer
You should stop here if you want to attempt to implement a visitor pattern for your grammar productions on your own. I'm going to give a fairly complete but simple example that is full of spoilers.
The concept behind a visitor pattern seems bizarre, but it makes total sense. Each grammar production knows what it should be doing at different stages, so you might as well keep the code for that stage near the data it needs. To demonstrate this I've written a small dummy PunyPyAnalyzer that simply prints out the parse using the visitor pattern. I'm only doing one sample grammar production so you can understand how this is done. I don't want to give you too many clues.
The first thing I do is define a Production class that all of my grammar productions will inherit from.
1 2 3
This has my initial analyze() method and takes the PunyPyWorld we'll use later. The first example grammar production is a FuncCall production:
1 2 3 4 5 6 7 8 9
Function calls have a name and a params, which is a Parameters production class for the function call's parameters. Look at the analyze() method and you'll see the first visitor function. When you get to the PunyPyAnalyzer you'll see how this is run, but notice this function then calls param.analyze(world) on each of this function's parameters:
1 2 3 4 5 6 7 8 9
That leads to the Parameters class, which contains each of the expressions that make up the parameters to the function. The Parameters.analyze simply goes through its list of expressions, of which we have two:
In this example I'm only adding two numbers, but I create a base Expr class and then an IntExpr and AddExpr class to do it. Each of these simply have analyze() methods that print out their contents.
With that we have the classes for our parse trees, and we can do some analysis. The first thing we need is a world that can keep track of variable definitions, functions, and other things our Production.analyze() methods need.
1 2 3 4 5
When any Production.analyze() method is called, a PunyPyWorld object is passed to it so the analyze() method knows the state of the world. It can update variables, look for functions, and do anything else needed in the world.
We then need a PunyPyAnalyzer that can take a parse tree and the world and make all the grammar productions run:
1 2 3 4 5 6 7 8
This is easy enough to then set up for a simple call to the function hello(10 + 20):
1 2 3 4 5 6 7 8 9
Be sure you understand how I structured that script variable. Notice how there's a list as the first thing?
Parser versus Analyzer
In this example I'm assuming the PunyPyParser has converted the NUMBER tokents to integers. In other languages you might leave only the tokens and have the PunyPyAnalyzer do the work of conversion. It all depends on where you want the errors to be, and where you can do the most useful analysis. If you put the work in the parser, then you can give early errors on formatting right away. If you put it in the analyzer, then you can give errors that use the entire parsed file to help.
The point of all these analyze() methods is not to just print things out but instead to change the internal state of each Production subclass so that the interpreter can run it like a script. Your job in this exercise is to take your grammar production classes (which might be different from mine) and make them analyze.
Feel free to steal my starting point. You can take my analyzer and my world if you need, but you should have attempted to write your own first. You should also compare your production classes from Exercise 33 to mine. Are yours better? Can they support this design? Change them if they can't.
Your analyzer will need to do a few things for the interpreter to work correctly:
- Keep track of variable definitions. In a real language this would require some very complicated nested tables, but for Puny Python just assume there's one giant table (a TSTree or dict) that all variables are in. This means the x and y parameters to the hello(x, y) function are actually global variables.
- Keep track of where functions are so you can run them later. Our Puny Python is going to just have simple functions that you can run, but when the Interpreter runs it needs to "jump" to them and run them. Best way to do that is to keep them around for later.
- Check for any errors you can think of, such as missing variables in usage. This is tricky because a language like Python does more error checking in the Interpreter stage. You should decide what errors are possible during analysis and implement them. For example, if I tried to use a variable that has not been defined what happens?
- If you've implemented Python INDENT syntax correctly, then your FuncCall productions should have their attached code. The Interpreter is going to need that to run it, so make sure there's a way to get to it.
- This exercise is fairly hard already, but how would you create a better way to store variables that implements at least one more level of scope? Remember "scope" is the idea that the x, y in hello(x, y) do not impact an x or y variable you define outside the hello function.
- Implement assignment in your Scanner, Parser, and Analyzer. That means I should be able to do x = 10 + 14 and you can handle it.
Research the difference between "expression-" and "statement-based" programming languages. The short version is some languages have only expressions, so everything has some kind of return value associated with it. Other languages have expressions which have value and statements which have no value, so assigning variables to them should fail. Which kind of language is Python?