Parallel Summing Without Threads

Calculating the sum of an array of numbers is a simple task:

      sum = 0; 

      for each number in an array
          sum += number; 

However, summing is an important primitive function of databases and other applications, and needs to be sped up when the size of the array increases to a huge number such as 100M+. Using threads is probably the first impulse since this is such an easily parallizeable problem. But for a smaller array, such as one containing 10 numbers, multi-threading will make a summing function slower than the naive version because of the extra resources used.

So, how can a summing function be implemented so that it is always faster than the naive version?

To answer that question, let's first understand how the machine works. 
Memory Hierarchy



This diagram shows the memory hierarchy. According to the book Computer Architecture, the access time of the main memory is up to 1,000,000 times faster than the access time of the disk. The access time of the cache is up to 500 times faster than the access time of the main memory. Finally, the access time of the registers is up to 100 times faster than the access time of the cache.

When an array of numbers is resident in memory (maybe read from on disk data files), the numbers have to be loaded into the cache and registers before the CPU can add them. The key issue is how to prevent CPU from waiting. This is called "stalling."

As it turns out, there is a concept called the cache line: the CPU loads data in small batches such as 32 or 64 bytes at a time. Also, all CPUs nowadays have support for superscalar.

Given this knowledge, we can leverage this intrinsic parallelism without explicitly using threads!

      sub_sum1 = 0; ... sub_sumX= 0; 
      for every X numbers in the array:  
          sub_sum1  += number; 
          ...
          sub_sum X += number; 

This technique is also called loop unrolling.

I implemented this in both C and Java.

For C, use -O2, -O4 when compiling to optimize code generation. 
For Java, I also tried it with java -server -client when running in terminal.

The fastest was the unroll 4. Compared to unrolling 2, it was almost 1.2X as fast.
For this particular code, Java in HotSpot was a little faster than C (after warming up, of course).

Here are the numbers...





Loop unrolling helps to halve the performance for non trivial size arrays, and it's fast for small arrays as well. 

Multi-threading and loop unrolling combined can fully utilize the machine resources. 

Popular posts from this blog

Building A Toy Language Interpreter in Go

Space Race: a simple Rust game built on the Bevy framework

Building a Toy Language Compiler in Go