However, most traditional algorithms are developed in a sequential way (which is easier to design and analyze). Without redesigning the algorithm in a parallelized form, they are not ready to run on multiple CPUs. Recently, Google's Map/Reduce model has gained momentum to become the de facto approach to handle high volume processing using large number of low-cost commodity hardware. In the Opensource community, Hadoop is a Java clone of Google's Map/Reduce model, and there are a couple of Ruby clone as well. Since then, parallelizing traditionally sequential algorithm to run on a multi-CPU network has been drawing a lot of attention in the software community.
A sequential algorithm contains a number of "steps" ordered by the sequence of execution. Parallelizing such an algorithm means trying to run these steps "simultaneously" on multiple CPUs, and hopefully can speed up the whole process of execution.
Lets define T(p) to be the time it takes to execute the algorithm in p CPUs.
So, T(1) is the time takes to execute on a single CPU.
Obviously, T(p) >= T(1) / p.
When T(p) == T(1) / p, we say it has linear speedup. Unfortunately, linear speedup is usually not possible when p increase beyond a certain number, due to "sequential dependency" and "coordination overhead".
StepA and StepB cannot be executed simultaneously if there is a sequential dependency between them. Sequential dependency means one step cannot be started before the other step has completed, which happens if
- StepB reads some data that StepA writes
- StepA reads some data that StepB writes
- StepA and StepB write to same data
T(p) >= max(T(1)/p, T(infinity))
Even steps can be execute in parallel, there are certain processing overhead such as
- Data need to be transfered to the corresponding CPU before processing can take place
- Schedule the CPU for execution and keep track of their corresponding work load
- Monitor the completion of all parallel tasks and move forward to next steps
Given T(p) >= max(T(1)/p, T(infinity)), there is no benefit to increase p beyond T(1)/T(infinity), which is called parallelism.
Our goal is to design the parallel algorithm to maximize parallelism: O-1(n) / O-infinity(n).
If we can do this, we can throw more CPUs to help when n increases.
Recall master method
T(n) = a.T(n/b) + f(n)
case 1: if f(n) << n ** log(a, base=b) T(n) = O(n ** log(a, base=b)) case 2: if f(n) ~ n ** log(a, base=b) T(n) = O((lg(n) ** k+1) * (n ** log(a, base=b))) case 3: if f(n) >> n ** log(a, base=b) T(n) = O(f(n))
Lets walk through an example ... of adding two arrays of size n.
This is of O(n) complexity
def sum(a, b) for i in 0 .. a.size c[i] = a[i] + b[i] return c
def sum(a, b, start, end) if start == end c[start] = a[start] + b[start] return mid = start + (end - start) / 2 spawn sum(a, b, start, mid) spawn sum(a, b, mid, end)
For a single CPU, the algorithm will be ...
T(n) = 2.T(n/2) + O(1)
This is case 1, and so it is O(n)
For infinite number of CPU, the algorithm will be ...
T(n) = T(n/2) + O(1)
This is case 2, k = 0, so it is O(lg(n))
So the parallelism = O(n / lg(n))
In other words, we can improve the performance from the sequential algorithm O(n) to the parallel algorithm O(n/p) by throwing in p CPUs. And the growth of p is limited by n/lg(n)