next up previous
Next: Off-line Load Balancing Up: Microsimulation Previous: Driving Logic

   
Micro-simulation parallelization

The main advantage of the CA is that it forces the design of a parallel and local update, that is, the state at time step t+1depends only on information from time step t, and only from neighboring cells. (To be completely correct, one would have to consider our sub-time-steps.) This means that domain decomposition for parallelization is straightforward (see below), since one can communicate the boundaries for time step t, then locally on each CPU perform the update from t to t+1, and then exchange boundary information again. It would even be possible to overlap communication and computation, although experiments with it did not show any systematic improvement on the machines we tried it on (Sparc5 coupled via 10 Mbit Ethernet).

Rickert, Wagner and Gawron [,] implemented a parallelized traffic simulation running several times faster then real-time4 for the German Autobahn network on an SGI Power Challenger. A summary about how the traffic CA has been used in simulation models can for example be found in [].

As stated above, the inherent structure of a traffic microsimulation favors a domain decomposition as the general approach to parallelization:

We decided to have the boundaries between CPUs in the middle of links rather than at nodes. This separates the traffic complexity at the nodes from the complexity caused by the parallelization and makes optimization easier.

In order to distribute the graph across CPUs, one approach is orthogonal recursive bi-section. In our case, since we cut in the middle of links, the first step is to accumulate computational loads at the nodes: Each node gets a weight corresponding to the computational load of all of its attached half-links. Nodes are located at their geographical coordinates. Then, a vertical straight line is searched so that roughly half of the computational load is on its right and the other half on its left. Then the larger of the two pieces is picked and cut again, this time by a horizontal line. This is recursively done until as many tiles are obtained as there are CPUs available, see Fig. 6. It is immediately clear that under normal circumstances this will be most efficient for a number of CPUs that is a power of two. As one consequence, the tiles can directly exchange boundary information containing all data necessary for the evaluation of the CA rule sets, resulting only in local communication between neighboring tiles.

Another option is to use the METIS library for graph partitioning []. That library considerably reduces the number of boundary edges, as shown in Fig. 7. An example of the resulting tiling can be seen in Fig. 8; for example, the algorithm now picks up the fact that cutting along the rivers should be of advantage.

Such an investigation also allows to compute the theoretical efficiency based on the graph partitioning. Efficiency is optimal if each CPU gets exactly the same computational load. However, since for the entities that we distribute the weights are given, load imbalances are unavoidable, and they become larger with more CPUs. We define this theoretical efficiency due to the graph partitioning as

 \begin{displaymath}
eff := {\hbox{load on optimal partition} \over \hbox{load on largest
partition}} \ ,
\end{displaymath} (1)

where the load on the optimal partition is just the total load divided by the number of CPUs. We then calculated this number for actual partitionings of both of our 20024 links and of our 200000 links Portland networks, see Fig. 9. The result means in short that, according to this measure, our 20024 links network would still run efficiently on 128 CPUs, and our 200000 links network would run efficiently on up to 1024 CPUs.

The simulation uses a parallel update with a global time-step. However, synchronization of all CPUs is only performed after a simulation sequence comprising approximately 10-20 time-steps. In between, there is only an implicit synchronization through the exchange of local boundaries. Strictly, the global synchronization is not necessary at all; it is in the code to simplify on-line interaction with the code (not discussed in this paper).

The actual implementation of the microsimulation was done by defining descendent C++ classes of the C++ base classes provided in the Parallel Toolbox. The underlying communication library has interfaces for both PVM and MPI. A description of the toolbox is beyond the scope of this paper. More information can be found in [].


  
Figure 6: Orthogonal bi-section
\includegraphics[width=\hsize]{ob-plot-gpl.eps}


  
Figure: Number of split edges as a function of the number of CPUs. Use of the METIS library results in considerably fewer split edges than orthogonal bisection. The theoretical scaling for orthogonal bisection is $N_{split} \sim \sqrt{p}$, where p is the number of CPUs. Note that for $p \to N_{links}$, Nsplit needs to be the same for both graph partitioning methods.
\includegraphics[width=\textwidth]{splitedges-gpl.eps}


  
Figure 8: Partitioning by METIS. Compare to Fig. 6.
\includegraphics[width=\hsize]{metis-plot-gpl.eps}


  
Figure 9: Top: Theoretical efficiency for Portland network with 20024 edges. Bottom: Theoretical efficiency for Portland network with 200000 edges. ``OB'' refers to orthogonal bisection. ``METIS (k-way)'' refers to an option in the METIS library.
\includegraphics[width=\textwidth]{efficiency-e2-gpl.eps} \includegraphics[width=\textwidth]{efficiency-allstr-gpl.eps}


next up previous
Next: Off-line Load Balancing Up: Microsimulation Previous: Driving Logic
Kai Nagel
1999-12-12