The main advantage of the CA is that it forces the design of a parallel and local update, that is, the state at time step t+1depends only on information from time step t, and only from neighboring cells. (To be completely correct, one would have to consider our sub-time-steps.) This means that domain decomposition for parallelization is straightforward (see below), since one can communicate the boundaries for time step t, then locally on each CPU perform the update from t to t+1, and then exchange boundary information again. It would even be possible to overlap communication and computation, although experiments with it did not show any systematic improvement on the machines we tried it on (Sparc5 coupled via 10 Mbit Ethernet).
Rickert, Wagner and Gawron [,] implemented a parallelized traffic simulation running several times faster then real-time4 for the German Autobahn network on an SGI Power Challenger. A summary about how the traffic CA has been used in simulation models can for example be found in [].
As stated above, the inherent structure of a traffic microsimulation favors a domain decomposition as the general approach to parallelization:
In order to distribute the graph across CPUs, one approach is orthogonal recursive bi-section. In our case, since we cut in the middle of links, the first step is to accumulate computational loads at the nodes: Each node gets a weight corresponding to the computational load of all of its attached half-links. Nodes are located at their geographical coordinates. Then, a vertical straight line is searched so that roughly half of the computational load is on its right and the other half on its left. Then the larger of the two pieces is picked and cut again, this time by a horizontal line. This is recursively done until as many tiles are obtained as there are CPUs available, see Fig. 6. It is immediately clear that under normal circumstances this will be most efficient for a number of CPUs that is a power of two. As one consequence, the tiles can directly exchange boundary information containing all data necessary for the evaluation of the CA rule sets, resulting only in local communication between neighboring tiles.
Another option is to use the METIS library for graph partitioning []. That library considerably reduces the number of boundary edges, as shown in Fig. 7. An example of the resulting tiling can be seen in Fig. 8; for example, the algorithm now picks up the fact that cutting along the rivers should be of advantage.
Such an investigation also allows to compute the theoretical
efficiency based on the graph partitioning. Efficiency is optimal if
each CPU gets exactly the same computational load. However, since for
the entities that we distribute the weights are given, load imbalances
are unavoidable, and they become larger with more CPUs. We define
this theoretical efficiency due to the graph partitioning as
The simulation uses a parallel update with a global time-step. However, synchronization of all CPUs is only performed after a simulation sequence comprising approximately 10-20 time-steps. In between, there is only an implicit synchronization through the exchange of local boundaries. Strictly, the global synchronization is not necessary at all; it is in the code to simplify on-line interaction with the code (not discussed in this paper).
The actual implementation of the microsimulation was done by defining descendent C++ classes of the C++ base classes provided in the Parallel Toolbox. The underlying communication library has interfaces for both PVM and MPI. A description of the toolbox is beyond the scope of this paper. More information can be found in [].
|
|