next up previous
Next: Speed-Up and Efficiency Up: Evaluation of Performance of Previous: Evaluation of Performance of

Execution Time

The execution time of a parallel program is defined as the total time elapsed from the time the first processor starts execution to the time the last processor completes the execution. During execution, a processor is either computing or communicating. Therefore,


\begin{displaymath}T(p) = T_{cmp}(p) + T_{cmm}(p) \ , \end{displaymath} (2)

where T is the execution time, p is the number of processors, Tcmp is the computation time and Tcmm is the communication time.

The time required for the computation, namely, Tcmp can be calculated roughly in terms of the serial execution time (run time of the algorithm on a single CPU) and the number of processors. Thus,


 \begin{displaymath}
T_{cmp}(p) = {T_1 \over p} \cdot \Big( 1 + f_{ovr}(p) + f_{dmn}(p) \Big) \ . \end{displaymath} (3)

where T1is the serial execution time, p is the number of CPUs; fovr includes overhead effects (for example, split links need to be administered by both CPUs); fdmn = 1/edmn - 1 includes the effect of unequal domain sizes as shown in Equation 1 in graph partitioning section.

Time for communication typically has two contributions: Latency and bandwidth. Latency is the time necessary to initiate the communication, and in consequence it is independent of the message size. Bandwidth describes the number of bytes that can be communicated per second. So the time for one message is

\begin{displaymath}T_{msg} = T_{lt} + {S_{msg} \over b} \ , \end{displaymath}

where Tlt is the latency, Smsg is the message size, and b is the bandwidth.

However, for many of today's computer architectures, bandwidth is given by at least two contributions: node bandwidth, and network bandwidth. Node bandwidth is the bandwidth of the connection from the CPU to the network. If two computers communicate with each other, this is the maximum bandwidth they can reach. For that reason, this is sometimes also called the ``point-to-point'' bandwidth.

The network bandwidth is given by the technology and topology of the network. Typical technologies are 10Mbit Ethernet, 100Mbit Ethernet, FDDI, etc. Typical topologies are bus topologies, switched topologies, two-dimensional topologies (e.g. grid/torus), hypercube topologies, etc. A traditional Local Area Network (LAN) uses 10Mbit Ethernet, and it has a shared bus topology. In a shared bus topology, all communication goes over the same medium; that is, if several pairs of computers communicate with each other, they have to share the bandwidth.

For example, in our 100Mbit FDDI network (i.e. a network bandwidth of bnet= 100Mbit) at Los Alamos National Laboratory, we found node bandwidths of about bnd = 40Mbit. That means that two pairs of computers could communicate at full node bandwidth, i.e. using 80 of the 100 Mbit/sec, while three or more pairs were limited by the network bandwidth. For example, five pairs of computers could maximally get 100/5 = 20 Mbit/sec each.

A switched topology is similar to a bus topology, except that the network bandwidth is given by the backplane of the switch. Often, the backplane bandwidth is high enough to have all nodes communicate with each other at full node bandwidth, and for practical purposes one can thus neglect the network bandwidth effect for switched networks.

If computers become massively parallel, switches with enough backplane bandwidth become too expensive. As a compromise, such supercomputers usually use a communications topology where communication to ``nearby'' nodes can be done at full node bandwidth, whereas global communication suffers some performance degradation. Since we partition our traffic simulations in a way that communication is local, we can assume that we do communication with full node bandwidth on a supercomputer. That is, on a parallel supercomputer, we can neglect the contribution coming from the bnet-term. This assumes, however, that the allocation of street network partitions to computational nodes is done in some intelligent way which maintains locality.

As a result of this discussion, we assume that the communication time per time step is


\begin{displaymath}T_{cmm}(p) = N_{sub} \cdot \Big( n_{nb}(p) \, T_{lt} + {N_{sp...
...ver b_{nd} } + N_{spl}(p) \, { S_{bnd} \over b_{net} }\Big)\ , \end{displaymath} (4)

where Nsub is the number of sub-time-steps. Since we do two boundary exchanges per time step, Nsub = 2 for the 1999 TRANSIMS micro-simulation implementation.

nnb is the number of neighbor domains each CPU talks to. All information which goes to the same CPU is collected and sent as a single message, thus incurring the latency only once per neighbor domain. For p = 1, nnb is zero since there is no other domain to communicate with. For p = 2, it is one. For $p \to \infty$ and assuming that domains are always connected, Euler's theorem for planar graphs says that the average number of neighbors cannot become more than six. Based on a simple geometric argument, we use


\begin{displaymath}n_{nb}(p) = 2 \, (3 \sqrt{p} - 1) \, (\sqrt{p} - 1) / p \ , \end{displaymath}

which correctly has nnb(1) = 0 and $n_{nb} \to 6$ for $p \to \infty$. Note that the METIS library for graph partitioning does not necessarily generate connected partitions, making this potentially more complicated.

Tlt is the latency (or start-up time) of each message. Tlt between 0.5 and 2 milliseconds are typical values for PVM on a LAN. Next are the terms that describe our two bandwidth effects. Nspl(p) is the number of split links in the whole simulation. Accordingly, Nspl(p)/p is the number of split links per computational node. Sbnd is the size of the message per split link. bnd and bnet are the node and network bandwidths, as discussed above.

In consequence, the combined time for one time step is


\begin{displaymath}T(p) = {T_1 \over p} \Big( 1 + f_{ovr}(p) + f_{dmn}(p) \Big) + \end{displaymath}


\begin{displaymath}N_{sub} \cdot \left( n_{nb}(p) \, T_{lt} + {N_{spl}(p) \over ...
...er b_{nd}} + N_{spl}(p) \, {S_{bnd} \over b_{net}} \right) \ . \end{displaymath}

According to what we have discussed above, for $p \to \infty$ the number of neighbors scales as $n_{nb} \sim const$ and the number of split links in the simulation scales as $N_{spl} \sim \sqrt{p}$. In consequence for fovr and fdmn small enough, we have:

Thus, in a shared topology, adding CPUs will eventually increase the simulation time, thus making the simulation slower. In a non-shared topology, adding CPUs will eventually not make the simulation any faster, but at least it will not be detrimental to computational speed. The dominant term in a shared topology for $p \to \infty$ is the network bandwidth; the dominant term in a non-shared topology is the latency.

The curves in Fig. 6 are results from this prediction for a switched 100 Mbit Ethernet LAN; dots and triangles show actual performance results [6]. The top graph shows the time for one time step, i.e.T(p), and the individual contributions to this value. One can clearly see that for more than 64 CPUs, the dominant time contribution comes from the latency.The bottom graph shows the real time ratio (RTR)

\begin{displaymath}rtr(p) := {\Delta t \over T(p)} = {1 sec \over T(p)} \ , \end{displaymath}

which says how much faster than reality the simulation is running. $\Delta t$is the duration a simulation time step, which is 1 sec in TRANSIMS-1999. This figure shows that even something as relatively profane as a combination of regular Pentium CPUs using a switched 100Mbit Ethernet technology is quite capable in reaching good computational speeds. For example, with 16 CPUs the simulation runs 40 times faster than real time; the simulation of a 24 hour time period would thus take 0.6 hours. These numbers refer to the Portland 200000 links network. Included in the plot (black dots) are measurements with a compute cluster that corresponds to this architecture. The triangles with lower performance for the same number of CPUs come from using dual instead of single CPUs on the computational nodes. Note that the curve levels out at about forty times faster than real time, no matter what the number of CPUs. As one can see in the top figure, the reason is the latency term, which eventually consumes nearly all the time for a time step. This is one of the important elements where parallel supercomputers are different: For example the Cray T3D has a more than a factor of ten lower latency under PVM [2].


  
Figure: 100 Mbit switched Ethernet LAN. Top:From [6]. Individual time contributions. Bottom: Corresponding Real Time Ratios. The black dots refer to actually measured performance when using one CPU per cluster node; the crosses refer to actually measured performance when using dual CPUs per node (the y-axis still denotes the number of CPUs used). The thick curve is the prediction according to the model. The thin lines show the individual time contributions to the thick curve.
\resizebox*{0.8\textwidth}{0.4\textheight}{\includegraphics{cars-time-gpl.eps}}
\resizebox*{0.8\textwidth}{0.4\textheight}{\includegraphics{cars-rtr-gpl.eps}}


Fig. 7 shows some predicted real time ratios for other computing architectures. For simplicity, we assume that all of them except for one special case explained below use the same 500MHz Pentium compute nodes. The difference is in the networks: We assume 10Mbit non-switched, 10Mbit switched, 1Gbit non-switched, and 1Gbit switched. The curves for 100Mbit are in between and were left out for clarity; values for switched 100Mbit Ethernet were already in Fig. 6. One clearly sees that for this problem and with today's computers, it is nearly impossible to reach any speed-up on a 10Mbit Ethernet, even when switched. Gbit Ethernet is somewhat more efficient than 100Mbit Ethernet for small numbers of CPUs, but for larger numbers of CPUs, switched Gbit Ethernet saturates at exactly the same computational speed as the switched 100Mbit Ethernet. This is due to the fact that we assume that latency remains the same - after all, there was no improvement in latency when moving from 10 to 100Mbit Ethernet. FDDI is supposedly even worse [2].


  
Figure: From [6]. Predictions of real time ratio for other computer configurations. Top: With Portland 20024 links network. Bottom: With Portland 200000 links network. Note that for the switched configurations and for the supercomputer, the saturating real time ratio is the same for both network sizes, but it is reached with different numbers of CPUs. This behavior is typical for parallel computers: They are particularly good at running larger and larger problems within the same computing time. All curves in both graphs are predictions from our model. We have some performance measurements for the ASCI machine, but since they were done with an older and slower version of the code, they are omitted in order to avoid confusion.
\resizebox*{0.8\textwidth}{0.37\textheight}{\includegraphics{other-rtr-gpl.eps}}
\resizebox*{0.8\textwidth}{0.37\textheight}{\includegraphics{ten-rtr-gpl.eps}}


The thick line in Fig. 7 corresponds to the ASCI Blue Mountain parallel supercomputer at Los Alamos National Laboratory. On a per-CPU basis, this machine is slower than a 500 MHz Pentium. The higher bandwidth and in particular the lower latency make it possible to use higher numbers of CPUs efficiently, and in fact one should be able to reach a real time ratio of 128 according to this plot. By then, however, the granularity effect of the unequal domains (Fig. 3) would have set in, limiting the computational speed probably to about 100 times real time with 128 CPUs. We actually have some speed measurements on that machine for up to 96 CPUs, but with a considerably slower code from summer 1998. We omit those values from the plot in order to avoid confusion.

Fig. 7 (bottom) shows predictions for the higher fidelity Portland 200000links network with the same computer architectures. The assumption was that the time for one time step, i.e. T1 of Eq. (3), increases by a factor of eight due to the increased load. This has not been verified yet. However, the general message does not depend on the particular details: When problems become larger, then larger numbers of CPUs become more efficient. Note that we again saturate, with the switched Ethernet architecture, at 80 times faster than real time, but this time we need about 64 CPUs with switched Gbit Ethernet in order to get 40 times faster than real time -- for the smaller Portland 200000 links network with switched Gbit Ethernet we would need 8 of the same CPUs to reach the same real time ratio. In short and somewhat simplified: As long as we have enough CPUs, we can micro-simulate road networks of arbitrarily large size, with hundreds of thousands of links and more, 40 times faster than real time, even without supercomputer hardware. -- Based on our experience, we are confident that these predictions will be lower bounds on performance: In the past, we have always found ways to make the code more efficient.


next up previous
Next: Speed-Up and Efficiency Up: Evaluation of Performance of Previous: Evaluation of Performance of
Nurhan Cetin
2001-05-31