The execution time of a parallel program is defined as the total time elapsed from the time the first processor starts execution to the time the last processor completes the execution. During execution, a processor is either computing or communicating. Therefore,
(2) |
where T is the execution time, p is the number of processors, Tcmp is the computation time and Tcmm is the communication time.
The time required for the computation, namely, Tcmp can be calculated roughly in terms of the serial execution time (run time of the algorithm on a single CPU) and the number of processors. Thus,
where T1is the serial execution time, p is the number of CPUs; fovr includes overhead effects (for example, split links need to be administered by both CPUs); fdmn = 1/edmn - 1 includes the effect of unequal domain sizes as shown in Equation 1 in graph partitioning section.
Time for communication typically has two contributions: Latency and bandwidth.
Latency is the time necessary to initiate the communication, and in consequence
it is independent of the message size. Bandwidth describes the number of bytes
that can be communicated per second. So the time for one message is
where Tlt is the latency, Smsg is the message size, and b is the bandwidth.
However, for many of today's computer architectures, bandwidth is given by at least two contributions: node bandwidth, and network bandwidth. Node bandwidth is the bandwidth of the connection from the CPU to the network. If two computers communicate with each other, this is the maximum bandwidth they can reach. For that reason, this is sometimes also called the ``point-to-point'' bandwidth.
The network bandwidth is given by the technology and topology of the network. Typical technologies are 10Mbit Ethernet, 100Mbit Ethernet, FDDI, etc. Typical topologies are bus topologies, switched topologies, two-dimensional topologies (e.g. grid/torus), hypercube topologies, etc. A traditional Local Area Network (LAN) uses 10Mbit Ethernet, and it has a shared bus topology. In a shared bus topology, all communication goes over the same medium; that is, if several pairs of computers communicate with each other, they have to share the bandwidth.
For example, in our 100Mbit FDDI network (i.e. a network bandwidth of bnet= 100Mbit) at Los Alamos National Laboratory, we found node bandwidths of about bnd = 40Mbit. That means that two pairs of computers could communicate at full node bandwidth, i.e. using 80 of the 100 Mbit/sec, while three or more pairs were limited by the network bandwidth. For example, five pairs of computers could maximally get 100/5 = 20 Mbit/sec each.
A switched topology is similar to a bus topology, except that the network bandwidth is given by the backplane of the switch. Often, the backplane bandwidth is high enough to have all nodes communicate with each other at full node bandwidth, and for practical purposes one can thus neglect the network bandwidth effect for switched networks.
If computers become massively parallel, switches with enough backplane bandwidth become too expensive. As a compromise, such supercomputers usually use a communications topology where communication to ``nearby'' nodes can be done at full node bandwidth, whereas global communication suffers some performance degradation. Since we partition our traffic simulations in a way that communication is local, we can assume that we do communication with full node bandwidth on a supercomputer. That is, on a parallel supercomputer, we can neglect the contribution coming from the bnet-term. This assumes, however, that the allocation of street network partitions to computational nodes is done in some intelligent way which maintains locality.
As a result of this discussion, we assume that the communication time per time step is
(4) |
where Nsub is the number of sub-time-steps. Since we do two boundary exchanges per time step, Nsub = 2 for the 1999 TRANSIMS micro-simulation implementation.
nnb is the number of neighbor domains each CPU talks to. All information which goes to the same CPU is collected and sent as a single message, thus incurring the latency only once per neighbor domain. For p = 1, nnb is zero since there is no other domain to communicate with. For p = 2, it is one. For and assuming that domains are always connected, Euler's theorem for planar graphs says that the average number of neighbors cannot become more than six. Based on a simple geometric argument, we use
which correctly has nnb(1) = 0 and for . Note that the METIS library for graph partitioning does not necessarily generate connected partitions, making this potentially more complicated.
Tlt is the latency (or start-up time) of each message. Tlt between 0.5 and 2 milliseconds are typical values for PVM on a LAN. Next are the terms that describe our two bandwidth effects. Nspl(p) is the number of split links in the whole simulation. Accordingly, Nspl(p)/p is the number of split links per computational node. Sbnd is the size of the message per split link. bnd and bnet are the node and network bandwidths, as discussed above.
In consequence, the combined time for one time step is
According to what we have discussed above, for the number of neighbors scales as and the number of split links in the simulation scales as . In consequence for fovr and fdmn small enough, we have:
The curves in Fig. 6 are results from this
prediction for a switched 100 Mbit Ethernet LAN; dots and triangles show actual
performance results [6]. The top graph shows the time for one time step, i.e.T(p),
and the individual contributions to this value. One can clearly see that for
more than 64 CPUs, the dominant time contribution comes from the latency.The
bottom graph shows the real time ratio (RTR)
|
Fig. 7 shows some predicted real time ratios for other computing architectures. For simplicity, we assume that all of them except for one special case explained below use the same 500MHz Pentium compute nodes. The difference is in the networks: We assume 10Mbit non-switched, 10Mbit switched, 1Gbit non-switched, and 1Gbit switched. The curves for 100Mbit are in between and were left out for clarity; values for switched 100Mbit Ethernet were already in Fig. 6. One clearly sees that for this problem and with today's computers, it is nearly impossible to reach any speed-up on a 10Mbit Ethernet, even when switched. Gbit Ethernet is somewhat more efficient than 100Mbit Ethernet for small numbers of CPUs, but for larger numbers of CPUs, switched Gbit Ethernet saturates at exactly the same computational speed as the switched 100Mbit Ethernet. This is due to the fact that we assume that latency remains the same - after all, there was no improvement in latency when moving from 10 to 100Mbit Ethernet. FDDI is supposedly even worse [2].
|
The thick line in Fig. 7 corresponds to the ASCI Blue Mountain parallel supercomputer at Los Alamos National Laboratory. On a per-CPU basis, this machine is slower than a 500 MHz Pentium. The higher bandwidth and in particular the lower latency make it possible to use higher numbers of CPUs efficiently, and in fact one should be able to reach a real time ratio of 128 according to this plot. By then, however, the granularity effect of the unequal domains (Fig. 3) would have set in, limiting the computational speed probably to about 100 times real time with 128 CPUs. We actually have some speed measurements on that machine for up to 96 CPUs, but with a considerably slower code from summer 1998. We omit those values from the plot in order to avoid confusion.
Fig. 7 (bottom) shows predictions for the higher fidelity Portland 200000links network with the same computer architectures. The assumption was that the time for one time step, i.e. T1 of Eq. (3), increases by a factor of eight due to the increased load. This has not been verified yet. However, the general message does not depend on the particular details: When problems become larger, then larger numbers of CPUs become more efficient. Note that we again saturate, with the switched Ethernet architecture, at 80 times faster than real time, but this time we need about 64 CPUs with switched Gbit Ethernet in order to get 40 times faster than real time -- for the smaller Portland 200000 links network with switched Gbit Ethernet we would need 8 of the same CPUs to reach the same real time ratio. In short and somewhat simplified: As long as we have enough CPUs, we can micro-simulate road networks of arbitrarily large size, with hundreds of thousands of links and more, 40 times faster than real time, even without supercomputer hardware. -- Based on our experience, we are confident that these predictions will be lower bounds on performance: In the past, we have always found ways to make the code more efficient.