Sunday, November 16, 2008

Design for parallelism

There has been a lot of interests around parallel computing recently. One of the main reasons is that we all know the Moore's law (which promise to double the CPU power on a single chip every 18 months) has reached its limit. We cannot no expect the speed of a single CPU to go much further. Instead of attempting to advance the clock rate of a CPU, many of the chip manufacturer has shifted their development focus to multi-core machines.

On the other hand, highly scalable system based on large pool of inexpensive commodity hardware has demonstrated significant success. Google has published the Map/Reduce model which is their underlying computing infrastructure and there are open source clone like Apache Hadoop. All these provides a very rich framework for implementing massively parallel system.

However, most software algorithms that we are using today are sequential in nature. We need to refactor them in order to fit into the parallel computing architecture

How do we do that ?

There are two different approaches to restructure a sequential algorithm into parallel, “functional decomposition” is typically used to deal with complex logic flow; and “map reduce” is used to deal with algorithm with large volume of input data with simple logic flow.


Functional Decomposition

This model attempts to break down the sequential algorithm into multiple “work units” from a functionality perspective and see if different work units can be executed in parallel. The whole analysis and design will typically go through the following steps.

Decomposition

The purpose of this step is to identify the function boundary of each work unit, which is the basic unit of execution that occurs in a specific machine sequentially
  • Analyze the processing steps from a functionality boundary perspective. Break down the whole processing into a sequence of work units where each work unit represents a focused function.
  • At this stage, we typically breakdown to the finest level of granularity so that we have more flexibility in the design stage to maximize the degree of parallelism.
Dependency analysis

After we break down the whole process into the finest grain of work units, we analyze the sequential dependency between different work units.

Lets say workUnitB is following workUnitA in the sequential version of algorithm, and R(B) and W(B) represents the read set and write set of work unit B. Then workUnitB is directly dependent on workUnitA if any of the following conditions is true
  • W(B) and W(A) overlaps
  • R(B) and W(A) overlaps
  • W(B) and R(A) overlaps
If we represent each work unit as a node and each “directly dependent” relationship as an arc, we will end up having a DAG (directed acyclic graph). The DAG gives us a good picture about what is the maximum parallelism that we can obtain. The critical path of the DAG provides the lower bound of the total execution time.



Analyzing communication overhead
However, as data need to be fed from an upstream work unit to its downstream work units, communication is not free as it consumes bandwidth and latency. In fact, parallelism introduces communication and coordination overhead. This purpose of this step is to understand the associated communication cost when data flow between work units.

Depends on the chosen framework technology, the communication mechanism can be one of the following …
  • TCP Point to point: Persistent TCP connections are maintained between different machines and will be used to pass data between its residing work units.
  • Multicast pub/sub: Downstream work units subscribe their interests to upstream work units and use a multicast mechanism to deliver data. The implementation of multicast can be based on IP multicast or epidemic message spreading over an overlay network.
  • Queue: Upstream work unit put their result into a queue, which is polled by its downstream work units. FIFO semantics is provided.
  • DFS: Upstream work unit put their results into a distributed file system, which is consumed by downstream work units. Unlike a queue, the communicating work units need to synchronize their access to the DFS themselves.

Aggregating work units

The purpose of this step is to regroup the work unit into coarser granularity to reduce communication overhead.

For example, if workUnitA is feeding large amount of data into workUnitB, then both work units should be put into the same machine to reduce the network bandwidth consumption. When there are multiple work units residing in the same machine, then they can be further aggregated into a larger unit. This aggregation can reduce the number of nodes in the dependency graph and hence make the scheduling more straightforward.


Another DAG is produced at the end of this step where each node represents the work aggregate.


Schedule execution

The work aggregates eventually need to be executed in some machines in the network. It is the responsibility of the scheduler to ship the job to available processors, and synchronize their execution.

A node (in the DAG) is ready for execution when all the preceding nodes are completed. There is also a pool of idle processors. A simple-mind scheduler will schedule a ready-to-execute node to a randomly picked processor from the idle pool. After the processor finishes executing a node, it will report back to the scheduler which will update the DAG and the idle processor pool. The cycle repeats.

A more sophisticated scheduler will consider more factors such as the network bandwidth between processors, estimated execution time of each node … etc. in order to provide an optimal scheduling where network bandwidth consumption is minimized.


Map Reduce

For data intensive application, large amount of data need to be processed within a single work unit although the DAG itself is simple. In this model, just running different work unit in parallel is not sufficient, the execution within a work unit also need to be parallelized and run across multiple machines.

The design methodology is different here. Instead of focusing in the flow between work units, we need to focus the input data pattern of a single work unit. Map/Reduce model is a common choice to handle this scenario. The analysis and design will typically go through the following steps.
  1. Identify the repetition of input data, determine the basic unit of input record. ie: input
  2. Identify the selection criteria of each input record. ie: select() function
  3. For each input record, determine how many entries to be emitted and how the emit entries should be grouped and process together. ie: handle_map(), key(), value() function
  4. Determine the aggregation logic of grouped entries. ie: handle_reduce() function
  5. Identify the selection criteria of each aggregated result. ie: having() function
If we use the Map/Reduce framework such as Hadoop, we can structure the map() and reduce() function as follows:


Conclusion

By following a systematic methodology to transform a sequential application into parallel one, we can take advantage of the parallelism to make the application more scalable.

1 comment:

B. said...

This is an excellent post. Wish i had come across it earlier!