Base 64 and Binary Line Reader
Base64 is a standard to encode binary data into ASCII characters for easy transportation (such as email/SMTP and HTTP/MIME). People also often use Base64 as a storage format for sensor and computer log data because they can use line break characters as record boundary markers.
To decode Base64 data, a common way in Java is to use BufferedReader's readLine method to read encoded records line by line and then decode them. The problem is, readLine returns a String object, which is fairly expensive if there are billions of records. Since the original data is in binary, the String objects are unnecessary. In Java, there are no JDK classes that can read bytes line by line. I implemented ByteLineReader to fill the gap. Although the idea sounds simple, it was a little tricky to manage the buffers. On my 2 year old Macbook Pro, my implementation can reach 70 MB/s to saturate my HDD bandwidth.
I found two Base64 decoder implementations to test my code. One is from Robert Harder and the other is from Mikael Grev. (There is no JDK Base64 utility class.) In order to benchmark the performance of these two decoders, I wrote DataGen code to generate some test data: 1M lines of 1KB, 4KB, and 8KB records.
I ran each test five times and removed the first run from each test (this is an outlier due to the HotSpot JVM warming up). R was my first choice to analyze the data. The bar plot thus shows the average of four runtimes. For the smaller records, MiG's decoder is faster, but for the 8K records, it is slightly slower. I did not test MiG's faster decoder method because it has some assumptions about the encoded bytes.
It was really fun to write the ByteLineReader, integrate it with the Base64 decoders, and use R to analyze the data (and not let the HotSpot warming up to pollute the actual data analysis)!