Warning
This book is new. If you'd like to go through it, then join the Learn Code Forum to get help while you attempt it. I'm also looking for feedback on the difficulty of the projects. If you don't like public forums, then email help@learncodethehardway.org to talk privately.
Exercise 31: Regular Expressions
A regular expression (regex) is a succinct way to encode how a sequence of characters should be matched in a string. They are normally thought of as "scary" but, as you know, anything wrapped in fear is usually just taught wrong. The reality of regular expressions is they are a set of about eight symbols that tell a computer how to match a pattern. Used simply they are easy to understand. Where people run into trouble is trying to use incredibly complex regular expressions where an actual parser would be better. Once you understand these eight symbols and the limitations of regular expressions you'll see they aren't scary at all.
I'm going to have you do some more memorization to prime your brain for the discussion. The important symbols to memorize are:
- ^
- Anchor beginning of the string. This will match only if the match starts right at the beginning.
- $
- Anchor end of the string. This will match only if it goes to the end.
- .
- Any one char. Accept any single character input.
- ?
- Optional previous. The previous part of the regex is optional, so A? means an optional "A" character.
- *
- 0 or more previous any number of times. Take the previous part of the regex and accept it repeatedly or skip over it. A* will accept "AAAAAAA" or "BQEFT" since there are zero A characters in it.
- +
- 1 or more previous at least once. Same as * but it only accepts if the regex has 1 or more of those characters. A+ will accept "AAAAAAA" but not "BQEFT".
- [X-Y]
- Class (range) of chars from X-Y. Accepts any of the characters listed in the range from X to Y. Using [A-Z] is all capital English letters. There are \ short cuts for many common character ranges you can use instead of this.
- ()
- Capture this part of the regular expression for later. Many regular expression libraries are used to also replace, extract, or alter text. A capture will take the part of the regex inside the (), and save it for later use. Many libraries then let you reference these captures. If you did ([A-Z]+) that would capture 1 or more capital English letters.
The Python re library lists many more symbols, but most of them are some modifier to these eight or extra features not commonly found in regular expression libraries. You'll start by creating flash cards for these eight, focusing on the bold phrases (anchor end, optional previous) so you can recall them quickly and explain what they do.
Once you've memorized these symbols take the following regular expressions and translate them to English and use the Python re library to try the listed strings or any other strings you can think of.
- ".*BC?$"
- helloBC, helloB, helloA, helloBCX
- "[A-Za-z][0-9]+"
- A1232344, abc1234, 12345, b493034
- "^[0-9]?a*b?.$"
- 0aaaax, aaab9, 9x, 88aabb, 9zzzz
- "*-*"
- "-------***", "--", "****", "--"
- "A+B+C+[xyz]*"
- AAAABBCCCCCCxyxyz, ABBBBCCCxxxx, ABABABxxxx
Once you've translated them, use the Python re module to try them out in the shell like this:
You'll get the AttributeError: 'NoneType' on any that do not match because the re.match function returns a None when it doesn't match your regex.
Exercise Challenge
The challenge is to attempt to use your FSM module to implement one simple regular expression that does at least three of these operations. This will be a difficult challenge, but use the Python re library to help you plan and test your implementation of this regular expression. Then, once you know how to do this, never do it again. Life is too short to do things computers are already good at doing.
Study Drills
- Expand your flash cards to include every possible symbol in the Python re library documentation.
- If you ever want to match a * char, then you can escape it with \*. Most of the other symbols have this too.
- Make sure you know how to use re.ASCII because some parsing requirements need it.
Further Study
Look at the regex library which is better if you need Unicode support.