This book is new. If you'd like to go through it, then join the Learn Code Forum to get help while you attempt it. I'm also looking for feedback on the difficulty of the projects. If you don't like public forums, then email email@example.com to talk privately.
Exercise 18: Measuring Performance
In this exercise you're going to learn to use several tools to analyze the performance of the data structures and algorithms you've created. To keep this introduction focused and small we'll look at the performance of the sorting.py algorithms from Exercise 16, and then in the video I'll analyze the performance of all the data structures we've done so far.
Performance analysis and tuning is one of my favorite activities in computer programming. I'm that guy who will sit with a ball of tangled yarn while I watch TV and just pick it apart until it's all nice and orderly. I love teasing apart complicated mysteries, and code performance is one of the best complicated mysteries. There's also nice, useful tools for analyzing the performance of code, which makes it much nicer than debugging by comparison.
When you're coding don't try to implement performance improvements unless they're obvious. I much prefer to keep the initial version of my code very simple and naive so that I can ensure it works correctly. Then once it's working well, but maybe slow, I break out my profiling tools and start to look for ways to make it faster without reducing the stability. This last part is key because many programmers feel it's alright to reduce the stability and safety of their code if that makes the code faster.
In this exercise we'll be covering many different tools that are useful for Python and some general strategies for improving the performance of any code. The tools we'll use are:
Make sure you install any that need to be installed before you continue. Then get your copy of the sorting.py and test_sorting.py files so we can apply each of these tools to those algorithms.
The timeit module isn't very useful. All it does is take a string of Python code and run it with some timing. You can't pass it function references, .py files, or anything other than a string. We can test how long the test_bubble_sort function takes by writing this at the end of test_sorting.py:
if __name__ == '__main__': import timeit print(timeit.timeit("test_bubble_sort()", setup="from __main__ import test_bubble_sort"))
It also doesn't produce useful measurements or any information on why something might be slow. We need a way to measure how long pieces of code take to run, and this is simply too clunky to be useful.
cProfile and profile
The next two tools are much more useful for measuring the performance of your code. I recommend using cProfile to analyze your code's running time, and save profile for when you need more flexibility in your analysis. To run cProfile against your test, change the bottom of the test_sorting.py file to simply run the test functions:
if __name__ == '__main__': test_bubble_sort() test_merge_sort()
And change the max_numbers to around 800 or a big enough number that you can measure an effect. Once you do that then run cProfile on your code:
$ python -m cProfile -s cumtime test_sorting.py | grep sorting.py
I'm using | grep sorting.py just to narrow the output to the files I care about, but drop that part of the command to see the full output. The results I get on my fairly fast computer for 800 numbers is:
ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.145 0.145 test_sorting.py:1(<module>) 1 0.000 0.000 0.128 0.128 test_sorting.py:25(test_bubble_sort) 1 0.125 0.125 0.125 0.125 sorting.py:6(bubble_sort) 1 0.000 0.000 0.009 0.009 sorting.py:1(<module>) 1 0.000 0.000 0.008 0.008 test_sorting.py:33(test_merge_sort) 2 0.001 0.000 0.006 0.003 test_sorting.py:7(random_list) 1 0.000 0.000 0.005 0.005 sorting.py:37(merge_sort) 1599/1 0.001 0.000 0.005 0.005 sorting.py:47(merge_node) 7500/799 0.004 0.000 0.004 0.000 sorting.py:72(merge) 799 0.001 0.000 0.001 0.000 sorting.py:27(count) 2 0.000 0.000 0.000 0.000 test_sorting.py:14(is_sorted)
I've added the headers back to the top so you can see what this output means. Each header means the following
- Number of calls to this function.
- Total execution time.
- Total Time per call to the function.
- Cumulative time for this function.
- Cumulative time per call.
- The filename, line number, and function involved.
Those header names are also the options available for the -s parameter. We can then do a quick analysis of this output:
- bubble_sort is called once, but merge_node is called a 1599 times, and merge even more at 7500 calls. This is because merge_node and merge are recursive, so they'll produce a large number of calls sorting a random list of 800 elements.
- Even though bubble_sort isn't called nearly as much as merge or merge_node it is a lot slower. This fits with the performance expectations of the two algorithms. The worst case for merge sort is O(n log n), but for bubble sort it's O(n^2). If you have 800 elements then 800 log 800 is about 5347, while 800^2 is 640000! These numbers don't necessarily translate into exact seconds these algorithms run, but they do translate into a relative comparison.
- The count function is called 799 times, which is most likely a huge waste. Our implementation of the DoubleLinkedList doesn't track the count of elements and instead has to run through the list every single time you want to know a count. We use that same method in the count function here, and that leads to 799 runs through the whole list for 800 elements. Change the max_numbers to 600 or 500 to see a pattern here. Notice how count is run n-1 times in our implementation? That means we go through almost all 800 elements.
Next let's look at how the dllist.py numbers impact this performance too:
$ python -m cProfile -s cumtime test_sorting.py | grep dllist.py ncalls tottime percall cumtime percall filename:lineno(function) 1200 0.000 0.000 0.001 0.000 dllist.py:66(shift) 1200 0.001 0.000 0.001 0.000 dllist.py:19(push) 1200 0.000 0.000 0.000 0.000 dllist.py:3(__init__) 1 0.000 0.000 0.000 0.000 dllist.py:1(<module>) 1 0.000 0.000 0.000 0.000 dllist.py:1(DoubleLinkedListNode) 2 0.000 0.000 0.000 0.000 dllist.py:15(__init__) 1 0.000 0.000 0.000 0.000 dllist.py:13(DoubleLinkedList)
Again I've added the headers back so you can see what's going on. In this you can see that the dllist.py functions don't impact performance that much compared to the merge, merge_node, and count functions. This is important because most programmers would run to optimize the DoubleLinkedList data structure where there's bigger gains to be made with the merge_sort implementation and completely ditch the bubble_sort. Always start with where you can get the most improvement for the least effort.
Analyzing performance is simply a matter of finding out what is slow and then trying to determine why it's slow. It's similar to debugging, except you are trying your best not to change the behavior of the code. When you're done, the code should work exactly the same but just do it faster. There are times when fixing the performance also finds bugs, but it's best to not attempt complete redesigns when you're trying to speed it up. One thing at a time.
Another important thing to have before you start analyzing performance is some metric of what is needed of the software. Fast is generally always good, but without a target you'll end up proposing possible solutions that are completely unnecessary. If your system is performing at 50 requests/second and you really only need 100 requests/second, then there's no point proposing a complete rewrite in Haskell to get 200. This process is all about "most money saved for the least effort possible", and you need some kind of measurement to target.
You can get most of these measurements from your operations staff, and they should have good graphs showing CPU usage, requests/second being served, frame rates, whatever they feel is important or that customers say is important. You can then work with them to devise tests that demonstrate the slowness that needs to be targeted so you can improve the code to reach the goals they need. You may be able to squeeze more performance out of the system, saving money. You may attempt it and come to the conclusion that it's just a hard problem requiring more CPU resources. With a metric to target, you'll have an idea of when to give up or that you've done enough.
The simplest process you can use to analyze performance is the following:
- Run a performance analyzer on the code as I've done here using your tests. The more information you can get, the better. See the Further Study section for additional tools that are free. Ask around for other tools people may be using to analyze the speed of the system.
- Identify the slowest and smallest pieces of code. Don't go to a giant function and try to analyze it. Many times these functions are slow because they are using a bunch of other slow functions. In finding the slowest and smallest first you're more likely to get the most gains for the least effort.
- Do a code review of these slow pieces of code and anything they touch, looking for possible reasons the code is slow. Do you have loops inside loops? Calling a function too often? Look for simple things that are possible to change before investigating complex techniques like caching.
- Once you have a list of all the slowest smallest functions and simple changes to make them faster, look for patterns. Are you doing this anywhere else you can't see?
- Finally, look for possible big improvements you can make if there's no simple changes to small functions you can make. Maybe it truly is time for a complete rewrite? Don't do this until you've at least attempted the simple fixes first.
- Keep a list of all the things you tried and all the performance gains you made. If you don't, then you'll constantly come back to functions you already worked on and waste effort.
As you work this process the idea of "slowest and smallest" will change. You'll fix a dozen 10 line functions and make them faster, which means now you can look at that 100 line function that's now the slowest. Once you have that 100 line function running quicker, you can look at the larger groups of functions being run and come up with strategies to speed them up.
Finally, the best way to speed anything up is to not do it at all. If you're doing multiple checks of the same condition, find a way to avoid doing it more than once. If you're calculating the same column in a database repeatedly, do it once. If you call a function in a tight loop, but the data rarely changes, check out memoization or pre-calculate a table. In many cases you can trade storage for computation by simply calculating things ahead of time and storing them once.
In the next exercise we'll actually go through improving the performance of these algorithms using this process.
Your challenge for this exercise is to apply what I did with the bubble_sort and merge_sort to all the data structures and algorithms you've created so far. You aren't expected to improve them yet but to just take notes and analyze the performance while developing tests that demonstrate performance problems. Resist the temptation to fix anything right now because we'll be improving the performance in Exercise 19.
- Run these profiling tools on all of your code so far and analyze the performance.
- Compare your results to the theoretical results for the algorithms and data structures.
- Try to write pathological tests that cause the data structures to break down. You may need to feed them large amounts of data, but use the profiling information to make sure you're doing it right.
- Look at line_profiler as another performance measurement tool. Its advantage is you can measure only the functions you care about, but the disadvantage is you have to change your source to do it.
- pyprof2calltree and KCacheGrind are more advanced tools but honestly only work on Linux. In the video I demonstrate using them under Linux.