Warning

This book is new. If you'd like to go through it, then join the Learn Code Forum to get help while you attempt it. I'm also looking for feedback on the difficulty of the projects. If you don't like public forums, then email help@learncodethehardway.org to talk privately.

Exercise 22: Suffix Arrays

I'd like to tell you a story about suffix arrays. I was interviewing at a company in Seattle during a time I was curious about how to most efficiently create a diff on an executable binary. My research brought me to the algorithms Suffix Array and Suffix Tree. The Suffix Array is simply where you sort all the suffixes of a string into a sorted list. The Suffix Tree is similar but done more like a BSTree than a list. These algorithms were fairly simple and had fast performance once you did the sorting operation. The problem they solved was finding the longest common substring between two strings (or in this case, lists of bytes).

You can make a suffix array in Python really easily:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
>>> magic = "abracadabra"
>>> magic_sa = []
>>> for i in range(0, len(magic)):
...     magic_sa.append(magic[i:])
...
>>> magic_sa
['abracadabra', 'bracadabra', 'racadabra', 'acadabra',
 'cadabra', 'adabra', 'dabra', 'abra', 'bra', 'ra', 'a']
>>> magic_sa = sorted(magic_sa)
>>> magic_sa
['a', 'abra', 'abracadabra', 'acadabra', 'adabra', 'bra',
 'bracadabra', 'cadabra', 'dabra', 'ra', 'racadabra']
>>>

As you can see, I simply took the suffixes of the string in order, and then sorted the list. But, what does this do for me? Once I have this list, then I can find any suffix I want by simply doing a binary search on this list. This example is very crude, but in real code you can do this very quickly, and you can keep track of all the original indexes so you can then reference the suffixes' original locations. It is very fast compared to other search algorithms and very useful in things like DNA analysis.

Back to the interview in Seattle. I'm in this cold room being interviewed by C++ programmers for a Java job. As you can tell this isn't a very fun interview, and I definitely don't think I'll get the job. I hadn't written any C++ in many years and the job was for Java, of which I was an expert at the time. In walks the next interviewer and he asks me, "How would you find a substring in a string."

Great! I've been studying the hell out of this problem in my free time. I'll nail it! I jump up and walk up to the whiteboard and explain to the guy how to make a suffix tree, how it improves search performance, how a modified heap sort makes it faster, how a suffix tree works, why it's better than a ternary search tree, and how to do it in C. I figure, if I can show how to write one in C then that'll demonstrate I'm not just a Java dude with no hard core metal background.

The guy is stunned. Like I opened a bag of fresh durian in the interview room. He looks at the board, and stammers, "Uh uh I was kind of looking for something about the Boyer-Moore search algorithm? Do you know that?" I screw up my face a little and say, "Yeah but, like 10 years ago." He shakes his head, grabs his stuff, and gets up saying, "Alright, well I'll let everyone know what I think."

A few minutes later in walks the next interviewer. He looks up at the white board, chuckles and scoffs at me, then asks me another C++ template meta programming question that I couldn't answer. I did not get the job.

Exercise Challenge

In this exercise you're going to take my little Python session and create your own suffix array search class. This class will take a string, carve it into a list of suffixes, and then allow the following operations on it:

find_shortest
Find the shortest substring that has this beginning. In the example above, if I search for "abra" then it should return "abra", not "abracadabra".
find_longest
Find the longest substring that has this beginning. If I search for "abra" you return "abracadabra".
find_all
Find all substrings that start with this beginning. This means "abra" returns both "abra" and "abracadabra".

You'll want to have a good automated test for this, plus have some performance measurements. We'll be using those in later exercises. Once you are done you'll want to do the Study Drills to complete this exercise.

Study Drills

  1. Once you have your tests working, rewrite this to use your BSTree to do the suffix sorting and searching. You can also use the value of each BSTreeNode to keep track of where this substring exists in the original string. You can then keep the original string around.
  2. How does the BStree change your code for the different searching operations? Does it make it simpler or harder?

Further Study

Definitely research suffix arrays and their applications. They are incredibly useful but not too well known by most programmers.

Pre-order Learn More Python The Hard Way

When you pre-order Learn More Python The Hard Way, you'll receive the Python 3 Edition as it's being created. All files are DRM free and you can download them to your computer for offline viewing. Digital Download Only! You do not get a physical book.

$29.00

Buy the Online Course

Or, you can read Learn More Python the Hard Way for free right here, video lectures not included.

Other Buying Options

Other buying options coming soon.

No content available for this exercise. You can view all available downloads at your customer account page.