Warning

This book is new. If you'd like to go through it, then join the Learn Code Forum to get help while you attempt it. I'm also looking for feedback on the difficulty of the projects. If you don't like public forums, then email help@learncodethehardway.org to talk privately.

Exercise 32: Scanners

My first book very casually covered scanners in Learn Python The Hard Way, Exercise 48, but now we're going to get more formal. I'll explain the concept behind scanning text, how that relates to regular expressions, and how you can create a little scanner for a tiny piece of Python code.

Let's use the following Python code as an example to start the discussion:

def hello(x, y):
    print(x + y)

hello(10, 20)

You've been training in Python for a while, so your brain most likely can read this code quickly, but do you really understand what it is? When I (or someone else) taught you Python I had you memorize all the "symbols". The def and () characters are each symbols, but there needs to be a way for Python to process these that is reliable and consistent. Python also needs to be able to read hello and understand it's a "name" for something, and then later know the difference between def hello(x, y) and hello(10, 20). How does it do this?

The first step to doing this is scanning the text looking for "tokens". At the scanning phase a language like Python doesn't first care what's a symbol (def) and what's a name (hello). It will simply try to convert the input language into patterns of text called "tokens". It does this by applying a sequence of regular expressions that "match" each of the possible inputs that Python understands. You'll remember from Exercise 31 that a regular expression is a way to tell Python what sequences of characters to match or accept. All the Python interpreter does is use many regular expressions to match every token it understands.

If you look at the code above, you might be able to write a set of regular expressions to process it. You'd need a simple regex for def that's just "def". You'd need more for the ()+:, characters. You'd then be left with how to handle print, hello, 10, and 20.

Once you've identified all the symbols in the code sample above you need to name them. You can't just refer to them by their regex, since it's inefficient to look up and also confusing. Later you'll learn that giving each symbol its own name (or number) simplifies parsing, but for now, let's devise some names for these regex patterns. I could say def is simply DEF, then ()+:, can be LPAREN RPAREN PLUS COLON COMMA. After that I can call the regex for words like hello and print simply NAME. By doing this I'm coming up with a way to convert the stream of raw text someone enters into a stream of single number (or named) tokens to use in later stages.

Python is also tricky because it needs a leading whitespace regular expression to handle the indenting and dedenting of the code blocks. For now, let's just use a fairly dumb one of ^\s+ and then pretend that this also captures how many spaces were used at the beginning of the line.

Eventually you'd have a set of regular expressions that can handle the preceding code and it might look something like this:

Regex Token
def DEF
[a-zA-Z_][a-zA-Z0-9_]* NAME
[0-9]+ INTEGER
\( LPAREN
\) RPAREN
\+ PLUS
: COLON
, COMMA
^\s+ INDENT

The job of a scanner is to take these regular expressions and use them to break the input text into a stream of identified symbols. If I do that to the code example I could produce this:

DEF NAME(hello) LPAREN NAME(x) COMMA NAME(y) RPAREN COLON
INDENT(4) NAME(print) LPAREN NAME(x) PLUS NAME(y) RPAREN
NAME(hello) RPAREN INTEGER(10) COMMA INTEGER(20) RPAREN

Study this transformation, matching each line of this scanner output, and compare it to the Python code above using the regular expressions in the table. You'll see that this is simply taking the input text and matching each regex to the token name and then saving any information needed like hello or the number 10.

Puny Python Scanner

I've written a very small little Python script that demonstrates scanning this tiny little Python language:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import re

code = [
"def hello(x, y):",
"    print(x + y)",
"hello(10, 20)",
]

TOKENS = [
(re.compile(r"^def"),                    "DEF"),
(re.compile(r"^[a-zA-Z_][a-zA-Z0-9_]*"), "NAME"),
(re.compile(r"^[0-9]+"),                 "INTEGER"),
(re.compile(r"^\("),                     "LPAREN"),
(re.compile(r"^\)"),                     "RPAREN"),
(re.compile(r"^\+"),                     "PLUS"),
(re.compile(r"^:"),                      "COLON"),
(re.compile(r"^,"),                      "COMMA"),
(re.compile(r"^\s+"),                    "INDENT"),
]

def match(i, line):
    start = line[i:]
    for regex, token in TOKENS:
        match = regex.match(start)
        if match:
            begin, end = match.span()
            return token, start[:end], end
    return None, start, None

script = []

for line in code:
    i = 0
    while i < len(line):
        token, string, end = match(i, line)
        assert token, "Failed to match line %s" % string
        if token:
            i += end
            script.append((token, string, i, end))

print(script)

When you run this script you get a list of tuples that are TOKEN, the matched string, beginning, and end like this:

[('DEF', 'def', 3, 3), ('INDENT', ' ', 4, 1), ('NAME', 'hello', 9, 5),
('LPAREN', '(', 10, 1), ('NAME', 'x', 11, 1), ('COMMA', ',', 12, 1),
('INDENT', ' ', 13, 1), ('NAME', 'y', 14, 1), ('RPAREN', ')', 15, 1),
('COLON', ':', 16, 1), ('INDENT', '    ', 4, 4), ('NAME', 'print', 9, 5),
('LPAREN', '(', 10, 1), ('NAME', 'x', 11, 1), ('INDENT', ' ', 12, 1),
('PLUS', '+', 13, 1), ('INDENT', ' ', 14, 1), ('NAME', 'y', 15, 1),
('RPAREN', ')', 16, 1), ('NAME', 'hello', 5, 5), ('LPAREN', '(', 6, 1),
('INTEGER', '10', 8, 2), ('COMMA', ',', 9, 1), ('INDENT', ' ', 10, 1),
('INTEGER', '20', 12, 2), ('RPAREN', ')', 13, 1)]

This code is definitely not the fastest or most accurate scanner you can create. This is a simple script to demonstrate the basics of how a scanner works. For doing real scanning work you would use a tool to generate a scanner that's much more efficient. I cover that in the Further Study section.

Exercise Challenge

Your job is to take this sample scanner code and turn it into a generic Scanner class to use later. The goal of this Scanner class is to take an input file, scan it into this list of tokens, and then allow you to pull the tokens out in order. The API should have the following capabilities:

__init__
Takes a similar list of tuples (without the re.compile) and configures the scanner.
scan
Takes a string and runs the scan on it, creating a list of tokens for later. You should keep this string around for people to access later.
match
Given a list of possible tokens, return the first one that matches the first token in the list and removes it.
peek
Given a list of possible tokens, returns which ones could work with match but does not remove it from the list.
push
Push a token back on the token stream so that a later peek or match will return it.

You should also create a generic Token class that replaces the tuple I'm using. It should be able to track the token found, the matched string, the beginning, and the end of where it matched in the original string.

Study Drills

  1. Install the library pytest-cov and use it to measure the coverage of your automated tests.
  2. Use the results of pytest-cov to improve your automated tests.

Further Study

The better way to create a scanner is to exploit the following three facts about regular expressions:

  1. Regular expressions are finite state machines.
  2. You can combine small finite state machines into larger more complex finite state machines accurately.
  3. A combined finite state machine that matches many small regexes will operate the same as each of the regular expressions on their own and be more efficient.

There are many tools that use this fact to accept a scanner definition, turn each little regex into FSMs, and then combine them to produce one fast piece of code that can reliably match all the tokens. The advantage of this is you can feed these generated scanners individual characters in a rolling fashion and have them identify the tokens rapidly. This is preferred to the way I'm doing it here where I chunk up the string and try a sequence of regex in order until I find one.

Research how scanner generators work and compare them to what you've written.

Pre-order Learn More Python The Hard Way

When you pre-order Learn More Python The Hard Way, you'll receive the Python 3 Edition as it's being created. All files are DRM free and you can download them to your computer for offline viewing. Digital Download Only! You do not get a physical book.

$29.00

Buy the Online Course

Or, you can read Learn More Python the Hard Way for free right here, video lectures not included.

Other Buying Options

Other buying options coming soon.

No content available for this exercise. You can view all available downloads at your customer account page.