## Tuesday, April 15, 2008

### Parallelizing Algorithms

The growth of a single CPU has been limited by physical factors such as clock rate, generated heat, power ... etc. Current trend is moving to multi-core system, ie: multiple CPU within a chip, multiple CPU within a machine, or just a cluster of machines connected to a high speed network.

However, most traditional algorithms are developed in a sequential way (which is easier to design and analyze). Without redesigning the algorithm in a parallelized form, they are not ready to run on multiple CPUs. Recently, Google's Map/Reduce model has gained momentum to become the de facto approach to handle high volume processing using large number of low-cost commodity hardware. In the Opensource community, Hadoop is a Java clone of Google's Map/Reduce model, and there are a couple of Ruby clone as well. Since then, parallelizing traditionally sequential algorithm to run on a multi-CPU network has been drawing a lot of attention in the software community.

Model

A sequential algorithm contains a number of "steps" ordered by the sequence of execution. Parallelizing such an algorithm means trying to run these steps "simultaneously" on multiple CPUs, and hopefully can speed up the whole process of execution.

Lets define T(p) to be the time it takes to execute the algorithm in p CPUs.
So, T(1) is the time takes to execute on a single CPU.
Obviously, T(p) >= T(1) / p.

When T(p) == T(1) / p, we say it has linear speedup. Unfortunately, linear speedup is usually not possible when p increase beyond a certain number, due to "sequential dependency" and "coordination overhead".

Sequential Dependency
StepA and StepB cannot be executed simultaneously if there is a sequential dependency between them. Sequential dependency means one step cannot be started before the other step has completed, which happens if
• StepB reads some data that StepA writes
• StepA reads some data that StepB writes
• StepA and StepB write to same data
Let T(infinity) be the execution time given infinite number of CPUs. Due to sequential dependency, at some point throwing in more CPUs won't help. If we use a DAG to represent dependency, T(infinity) is the time take to execute the longest path within the DAG.

T(p) >= max(T(1)/p, T(infinity))

Even steps can be execute in parallel, there are certain processing overhead such as
• Data need to be transfered to the corresponding CPU before processing can take place
• Schedule the CPU for execution and keep track of their corresponding work load
• Monitor the completion of all parallel tasks and move forward to next steps
We need to make sure the coordination overhead does not offset the gain in parallelizing the execution. That means we cannot break the steps into too fine-grain, we need to control the granularity of the steps at the right level.

Design Goal

Given T(p) >= max(T(1)/p, T(infinity)), there is no benefit to increase p beyond T(1)/T(infinity), which is called parallelism.

• Let O-1(n) be the time complexity of the parallel algorithm when there is one CPU
• Let O-infinity(n) be the time complexity of the parallel algorithm when there is infinite CPUs

• Our goal is to design the parallel algorithm to maximize parallelism: O-1(n) / O-infinity(n).

If we can do this, we can throw more CPUs to help when n increases.

Recall master method
T(n) = a.T(n/b) + f(n)

``````case 1:  if  f(n) << n ** log(a, base=b)
T(n) = O(n ** log(a, base=b))

case 2:  if  f(n) ~ n ** log(a, base=b)
T(n) = O((lg(n) ** k+1) * (n ** log(a, base=b)))

case 3:  if  f(n) >> n ** log(a, base=b)
T(n) = O(f(n))``````

Lets walk through an example ... of adding two arrays of size n.

Sequential Algorithm:
``````def sum(a, b)
for i in 0 .. a.size
c[i] = a[i] + b[i]
return c
``````
This is of O(n) complexity

Parallel Algorithm:
``````def sum(a, b, start, end)
if start == end
c[start] = a[start] + b[start]
return

mid = start + (end - start) / 2

spawn sum(a, b, start, mid)
spawn sum(a, b, mid, end)``````

For a single CPU, the algorithm will be ...

T(n) = 2.T(n/2) + O(1)
This is case 1, and so it is O(n)

For infinite number of CPU, the algorithm will be ...
T(n) = T(n/2) + O(1)
This is case 2, k = 0, so it is O(lg(n))

So the parallelism = O(n / lg(n))

In other words, we can improve the performance from the sequential algorithm O(n) to the parallel algorithm O(n/p) by throwing in p CPUs. And the growth of p is limited by n/lg(n)