This book is new. If you'd like to go through it, then join the Learn Code Forum to get help while you attempt it. I'm also looking for feedback on the difficulty of the projects. If you don't like public forums, then email firstname.lastname@example.org to talk privately.
Exercise 24: Fast URL Search
We will end the section on data structures and algorithms with a performance measurement challenge that applies your data structures to an actual problem. I've written a few web servers in my time, and one problem that constantly comes up is matching URL paths to "actions". You'll find this problem in every web framework, web server, and anything that has to "route" information based on a key that is hierarchical. When your web server receives the URL /do/this/stuff/, it has to figure out if each part is possibly attached to some kind of action or configuration. If you configured a web application at /do/, then what should your web server do with /this/stuff/? Should it consider that a failure or pass it to the web application? What if there's a directory at /do/this/? And, how do you detect bad URLs quickly so you're not processing giant requests that don't exist?
This kind of hierarchical search comes up frequently enough that it's a good final test of your ability to apply algorithms and data structures to problems as well as a test of your ability to do performance analysis.
First, be sure you understand what a URL is and how they are used. If you don't, then I suggest you take the time to go write one small Flask application that has some complex routing in it. It's this routing that you'll be implementing.
Next, you are expected to do the following:
Create a simple base URLRouter class that you'll subclass for all your implementations. You should be able to do the following to this UURLRouter:
- Add a new URL with an associated object.
- Get an exact match of a URL. A search for /do/this/stuff/ returns only what's at exactly that.
- Get the best match for a URL. A search for /do/this/stuff/ will match /do/ if that's the only match there is.
- Get all the objects that start with this URL.
- Get the shortest URL matching object. A search for /do/this/stuff/ would return /do/ but not /do/this/.
- Get the longest URL matching object. A search /do/this/stuff/ would return /do/this/ but not /do/.
Create and test a subclass of the URLRouter using the TSTree as this will be the easiest to do. Be sure to test the following:
- Randomized URLs and paths of different lengths in both the TSTree and what you search for
- Finding only partial paths in the different situations
- Completely non-existent paths
- Really long paths that exist and don't exist
Once you have this subclass working and tested, generalize your test so you can run it on all the implementations you're going to make.
Then, attempt implementations that use DoubleLinkedList, BSTree, Dictionary, and Python's dict. Make sure your generalized test works for all of these.
Once you have that, start analyzing the performance of each of these implementations for the different operations.
The goal is to see how fast a TSTree is compared to other data structures. It might beat most of them, but maybe the Python dict will win most of the time because it is optimized for Python. You may even want to make a bet as to which data structure will have the best performance for each operation.
- I left out the SuffixArray because it is similar to the TSTree, but to use it for this you'll have to add the same operations. Do that and then see how SuffixArray compares.
- Research how your favorite web server or web framework does this. You'll find not many people working with URLs know what a ternary search tree is despite how useful it is for this common operation.
I highly recommend the book "The Algorithm Design Manual" by Steven S. Skiena if you want to dive deeper into algorithms and data structures. His book uses C, so you may need to read Learn C The Hard Way first to be able to go through it. Other than that it's a very good book as it covers the theory, but also the practical aspects, of implementing and analyzing the performance of algorithms and data structures.