next up previous
Next: DISCUSSION Up: queue-parallel Previous: QUEUE MODEL

Subsections


PARALLEL COMPUTING OF THE QUEUE MODEL

Figure 2: (a) The separation of flow capacity from intersection dynamics. (b) Test suite results for intersection dynamics. The curves show the number of discharging vehicles from two incoming links as explained in section 4.2. (c) Decomposition of the Switzerland street network. Each color corresponds to a different processor. (d) Communication between nodes in the boundaries: Node N1 needs to communicate with N2 and N3. Since N2 and N3 are on the same processor, they do not need to establish a communication between themselves. (e),(f) show the real time ratio (top set of points) and speed-up (bottom set of points) for the 6-9 Scenario on single and dual CPU machines, using Ethernet or Myrinet. The x-axis refers to the number of CPUs, no matter if they were provided via 1-CPU or 2-CPU machines. The solid lines give the predictions according to the computing time $T_{cmp}$ and the latency term for Ethernet. As one can see, real time ratio and speed-up are related by a simple vertical shift in the logarithmic plot.
[]
[width=]buffer-fig.eps
[]
[width=]gz/i2to1.eps.gz
[]
[height=0.65]SwissDecomp-gz.ps
[]
[height=0.65]boundary-fig.eps
[] [width=0.5]rtr-gpl.eps [] [width=0.5]speedup-gpl.eps

Parallel Implementation

As was discussed above, the parallel target architecture for our transportation micro-simulation is a PC cluster. As also discussed, the suitable approach for this architecture is domain decomposition, i.e.decompose the traffic network graph into several pieces, and to give each piece to a different CPU. Information exchange between CPUs is achieved via messages.

Next, one needs to decide where to split the graph, and how to achieve the message exchange. Both questions can only be answered with respect to a particular traffic model, which is why they were not discussed previously. Nevertheless, lessons learned here can be used for other models.

In general one wants to split as far away from the intersection as possible. This implies that one should split links in the middle, as for example TRANSIMS in fact does (19). However, for the queue model ``middle of the link'' does not make sense since there is no real representation of space. In consequence, one can either split at the downstream end, or at the upstream end of the link. The downstream end is undesirable because vehicles driving towards an intersection are more influenced by the intersection than vehicles driving away from an intersection. For that reason, in the queue simulation we split them right after the intersection (Fig. 2(d)).

With respect to message passing, this implies that a CPU that ``owns'' a split link reports, via a message, the number of empty spaces to the CPU which ``owns'' the intersection from which vehicles can enter the link. After this, the intersection update can be done in parallel for all intersections. Next, a CPU that ``owns'' an intersection reports, via a message, the vehicles that have moved to the CPUs which ``own'' the outgoing links. See Fig. 1(c) for pseudo-code of how this is implemented using message passing. In fact, Algorithm B and Algorithm C together give the whole pseudo-code for the queue model traffic logic. For efficiency reasons, all messages to the same CPU at the same time should be merged into a single message in order to incur the latency overhead only once.


Performance Issues

Once the precise implementation of the parallelization is known, it is possible to predict the parallel performance. This has been done in detail in Ref. (19). Since that reference refers to TRANSIMS instead of to the queue model, and since we also have results regarding Myrinet (www.myri.com), the most important aspects will be repeated here, with special focus towards the specific problems encountered in the current situation.

The execution time of a parallel program is defined as the total time elapsed from the time the first processor starts execution to the time the last processor completes the execution. During execution, on a PC cluster, each processor is either computing or communicating. Therefore,

\begin{displaymath}
T(p) = T_{cmp}(p) + T_{cmm}(p) ,
\end{displaymath} (1)

where $T$ is the execution time, $p$ is the number of processors, $T_{cmp}$ is the computation time and $T_{cmm}$ is the communication time.

For traffic simulation, the time required for the computation, $T_{cmp}$ can be calculated roughly in terms of the run time of the computation on a single CPU divided by the number of processors. Thus,

\begin{displaymath}
T_{cmp}(p) \approx {T_{cmp}(1) \over p} ,
\end{displaymath} (2)

where $p$ is the number of CPUs. More exact formulas would also contain the overhead effects and unequal domain size effects (``load balancing'').

In deviation from Ref. (19), we now assume that the time for communication is well approximated by the latency only; this is in fact a good approximation for our system. Latency is incurred for each message that is sent. Since it is possible to pack all messages to the same processor into one message, one obtains

\begin{displaymath}
T_{lt} = N_{nb} \, t_{lt} \,
\end{displaymath}

where $N_{nb}$ is the number of neighboring processors (i.e. processors which are reached via common split links), and $t_{lt}$ is the latency per message.

The number of neighbors is zero at $p=1$, and goes, for contiguous domains, to an average of six for $p \to \infty$. An interpolating formula is

\begin{displaymath}
N_{nb}(p) = 2 \, (3 \sqrt{p} - 1) \, (\sqrt{p} - 1) / p \,
\end{displaymath} (3)

Latency of 100 Mbit Ethernet cards is about 0.5 ms; each processor sends messages twice per time step to all neighbors resulting in $\sim
12$ latency contributions or $6~ms$ per time step. In other words, the cluster can maximally do $1000/6=167$ time steps per second; in practice we find 130, see Sec. 5.3. If the time step of a simulation is one second, then this is also the maximum real time ratio of the parallel simulation, i.e.number which says how much faster than reality the computer is. Note that the limiting value does not depend on the problem size or on the speed of the algorithm; it is a limiting number for any parallel computation of a 2-dimensional system on a Beowulf cluster using Ethernet LAN.

The only way this number can be improved under the assumptions that we made is to use faster communication hardware. Gbit Ethernet hardware is faster, but standard driver implementations give away that advantage (37). In contrast, Myrinet (see www.myri.com) is a communication technology specifically designed for this situation. Interestingly, as we will see later, it will be possible to recoup the cost for a Myrinet network by being able to work with a smaller cluster.


Computational Performance

The parallel queue model is used as the traffic microsimulation module within the project of a microscopic and activity-based simulation of all of Switzerland. In this paper, we only report computational performance results; validation results with respect to field measurements are reported elsewhere (38). Also note that our queue model is fully calibrated and validated in terms of free speed link travel times and link flow capacity, since these numbers are taken from the input files and implemented exactly. The same would be true for link storage capacity if those numbers were available. It was described earlier (Sec. 4.2) how intersection priorities are modeled, and how the implementation was verified.

The following performance numbers refer to the morning rush hour in a road network with 10564 nodes and 28622 links. Demand contains all car trips of Switzerland which start between 6am and 9am; this results in 991471 trips. The largest number of vehicles simultaneously in the simulation is 162464 vehicles at 8am.

Our main performance measure is the Real Time Ratio (RTR) for this set-up. The RTR results for Ethernet and for Myrinet are shown in Fig. 2(e). The speedup for the same simulations are given in Fig. 2(f). The most important result with respect to the RTR is that the limiting computational speed of about 150 is confirmed: in fact, the simulation using 100 Mbit Ethernet saturates at that RTR. In contrast, when using Myrinet as a communication hardware, then the RTR does not saturate at this level. With respect to speed-up, with Ethernet that value saturates at about 16 with 64 CPUs. With Myrinet, the highest value is a speed-up of 27, reached on 64 CPUs. Results are reported using one or both CPUs of dual-CPU machines; in general, for a fixed number of CPUs using single CPUs per node is faster. We were not able to test if completely switching off the second CPU would change the results.

As one can see, RTR is transformed into Speedup by a simple vertical transposition. The differences would show up if one changed the scenario size: In the RTR plot, saturation for Ethernet would still happen at an RTR of about 130 but the graph would be shifted to the left or right for smaller or larger scenarios, respectively. In the Speed-Up plot, in contrast the level of saturation would depend on the scenario size while the part of the plot for small numbers of CPUs would not change. Note that in both cases the results depend on the scenario size; it is impossible to make parallel performance predictions without knowledge about the scenario size.

Because of the latency problem mentioned several times in this paper, in our view the RTR number is more important than the speed-up. In fact, we could reach better speed-up values by using a slower simulation, since then the communication overhead would be less in relation to the simulation. This has however not been our goal.

It is interesting to compare two different hardware configurations:

64 single CPU machines using 100 Mbit LAN.

Real time ratio 115.

Cost approx $64 \times \$2 k = \$128 k$ for the machines plus approx $\$20 k$ for a full bandwidth switch, resulting in $\$148 k$ overall.

32 dual CPU machines using Myrinet.

Real time ratio 193.

Cost approx $32 \times \$4.5 k = \$144 k$, Myrinet included.

That is, the Myrinet set-up is not only faster, but somewhat unexpectedly also cheaper. A Myrinet setup has the additional advantage that smaller scenarios than the one discussed will run even faster, whereas on the Ethernet cluster smaller scenarios will run with the same computational speed as the large scenarios.


next up previous
Next: DISCUSSION Up: queue-parallel Previous: QUEUE MODEL
Kai Nagel 2002-11-16