Performance Engineering

This past term, I've been spending most of my time on MIT's Performance Engineering class. It's a project based class, and since your code can always be faster, people often spend around 40 hours a week!

To get an idea of the areas the class covers, I'm linking my study sheet for the second exam, which I just had today. However, I learn the most from the projects. Here are some thoughts on concepts applied and lessons learned from each project (each of which I coded with a different partner):

Project 1: Bit Rotation

Final writeup here

The first project was mostly about bit hacks. The instructors gave us a reference implementation, that rotated the bit array by k bits by shifting the whole array one position at a time. This was very slow - running on AWS, the job didn't even finish because of the 20-second timeout! However, we were able to cut the time down to 44 microseconds (median of 5 runtimes on AWS).

The first trick is to change the algorithm. Instead of shifting the whole array to move 1 bit, we implemented the simple and fairly efficient algorithm for bit rotation found in the book Programming Pearls. (The algorithm rotates using 3 reversal operations. Say we want to rotate our subarray by k bits, and let R(x) be a reversal function for an array x. Furthermore, let the subarray be in the form ab, where a is the k-bits long left part and b is the right part. The algorithm transforms ab to R(a)R(b) to R(R(a)R(b)) = ba, which is our desired result.)

The main trick is to perform operations 64 bits at a time, instead of 1 bit at a time. The subarray the algorithm should rotate is not necessarily 64-bit-aligned or a multiple of 64, so we had to tweak our simple 3-reversal algorithm to handle this. This involved roughly reversing the minimum-length 64-bit-aligned subarray, then shifting the subarray and restoring the bits on either end. The details are in the writeup linked above.

Project 2: Collision Detection Using Quadtrees

For project 2, I learned about a new data structure: quadtrees. They are widely used for coding non-laggy intersection detection for games. For correct code, using Valgrind to detect memory leaks was a good check. To find hotspots in our code, we used gprof and perf record.

To get our code under the cutoff time of 2 seconds, we used arrays instead of linked lists to store the lines at each quadtree node. This change decreased our time because we were now sequentially accessing lines in memory instead of jumping around to random places in memory to read in each linked list node.

The other major change was adding bounding box fields for each line. Using gprof, we realized we were spending a lot of time in intersection detection, so we wanted to avoid calling it when possible. Checking if bounding boxes overlapped was a quick way to do this. If two bounding boxes didn't overlap, two lines didn't overlap, so we didn't have to call intersection detection on those two lines. The lesson here was using fast, rough code to filter out calls to slow, precise code.  

Tuning parameters, like the depth of the quadtree or the maximum number of nodes stored at each node, was also an easy way to gain speedup without doing much work. 

Ultimately, we were under 1.6 seconds total on the four test cases. 

Project 3: Serial Dynamic Memory Allocator

Final writeup here

This project was probably my favorite out of the 3. It was open ended, so we could search for papers and websites for inspiration for strategies.

In the end, we used a combination of free list binning, coalescing, splitting, handling the wilderness block differently, and multiple small-block allocation. Details on these strategies can be found in the writeup linked above.

Again, tuning parameters and using perf record were very useful in speeding up our code. We also used a lot of bit hacks to read and write to the header and the footer of our block metadata, as well as quickly check to see whether a block was in a free list or not.  Note for the future: __builtin_ctzll is tricky to safely use. It had undefined behavior for a parameter of 0 on our AWS machines, but not our Athena machines.

On a set of test traces, we were able to get a performance index of 94.7 +/- 0.3 (this score is calculated relative to the default C allocator - more details in the project description linked above.) Our utilization averaged 90.49% and our throughput was mostly 100% on our traces, except for 1 trace. (The traces were meant to stress test different situations of allocing and freeing blocks of memory. Some traces were a lot more difficult for us to improve performance on than others.)

All the writeups and files linked in this post can also be found on my website!