next up previous
Next: POSSIBLE TRAFFIC MODELS Up: queue-parallel Previous: INTRODUCTION

Subsections


PARALLEL COMPUTING FOR TRANSPORTATION SIMULATIONS

Discussion of Parallel Computing with a Special View Toward Transportation Simulation

Ultimately, we are interested in the agent-based simulation of large scale transportation scenarios. A typical scenario would be the 24-hour (about $10^5$ seconds) simulation of a metropolitan area consisting of 10 million travelers. Typical computational speeds of micro-simulations with 1-second update steps are 100000 vehicles in real time (10,11,5). This results in a computation time of $10^5 \times 10^7 / 10^5 = 10^7\hbox{ seconds} \approx
100\hbox{ days}$. This number is just a rough estimate and subject to the following changes: Increases in CPU speed will reduce the number; more realistic driving logic will increase the number; smaller time steps (13,12) will increase the number.

This means that the traffic simulation is too slow for practical or academic treatment of large scale problems. In addition, computer time is needed for activity generation, route generation, learning, etc. In consequence, it makes sense to explore parallel/distributed computing as an option.

The idea behind parallel computing is that a task can be achieved faster if it is divided into a set of subtasks, each of which is assigned to a different processor. A possible parallel computation environment is a cluster of standard Pentium computers, coupled via standard 100 Mbit Ethernet LAN (Local Area Network). In order to generate a parallel program, one must think about (i) how to partition the tasks into subtasks, and (ii) how to provide the data exchange between the subtasks. Since (i) depends on (ii), the discussion will be started with (ii).

With respect to communication, there are in general two main approaches to inter-processor communication. One of them is called message passing between processors; its alternative is to use shared-address space, where variables are kept in a common pool where they are globally available to all processors. Each paradigm has its own advantages and disadvantages.

In the shared-address space approach, all variables are globally accessible by all processors. Despite multiple processors operating independently, they share the same memory resources. The shared-address space approach makes it simpler for the user to achieve parallelism but since the memory bandwidth is limited, severe bottlenecks are unavoidable with an increasing number of processors, or alternatively such shared memory parallel computers become very expensive. For those reasons, our work concentrates on message passing.

In the message passing approach, there are independent cooperating processors. Each processor has a private local memory in order to keep the variables and data, and thus can access local data very rapidly. If an exchange of the information is needed between the processors, the processors communicate and synchronize by passing messages which are simple send and receive instructions. Message passing can be imagined to be similar to sending a letter. The following phases happen during a message passing operation.

1.
The message needs to be packed. Here, one tells the computer which data needs to be sent.

2.
The message is sent away.

3.
The message then may take some time on the network until it finally arrives in the receiver's inbox.

4.
The receiver has to officially receive the message, i.e. take it out of the inbox.

5.
The receiver has to unpack the message and tell the computer where to store the received data.

There are time delays associated with each of these phases. It is important to note that some of these time delays are incurred even for an empty message (``latency''), whereas others depend on the size of the message (``bandwidth restriction''). We will come back to this later.

The communication among the processors can be achieved by using a message passing library which provides the functions to send and receive data. There are several libraries such as MPI (14) (Message Passing Interface) or PVM (15) (Parallel Virtual Machine) for this purpose. Both PVM and MPI are software packages/libraries that allow heterogeneous PCs interconnected by a computer network to exchange data. They both define an interface for the different programming languages such as C/C++ or Fortran. For the purposes of parallel traffic simulation, the differences between PVM and MPI are negligible; we use MPI since it has slightly more focus on computational performance. - In principle, CORBA (Common Object Request Broker Architecture; www.corba.org) would be an alternative to MPI or PVM, in particular for task parallelization (see below); in practice, our own and other people's experiences are that it is difficult to use and because of the strict client-server paradigm is not well suited to our simulations, which assume that all tasks are on equal hierarchical levels.

Because of cost/benefit reasons, our work concentrates on clusters of coupled PCs. We expect this to be the dominant technology in the area for many years to come. Near the end of this paper, it will be explored what performance gains can be expected of when improving the communication hardware via Myrinet. Two general strategies are possible for parallelization on such an architecture:

Task parallelization - The different modules of a transportation simulation package (traffic simulation, routing, activities generation, learning, pre-/postprocessing) are run on different computers. This approach is for example used by DYNAMIT (16) or DYNASMART (17).

The advantage of this approach is that it is conceptually straightforward, and fairly insensitive to network bottlenecks. The disadvantage with this approach is that the slowest module will dominate the computing speed - for example, if the traffic-simulation is using up most of the computing time, then task parallelization of the modules will not help.

Domain decomposition - In this approach, each module is distributed across several CPUs. In fact, for most of the modules, this is straightforward since in current practical implementations activity generation, route generation, and learning are done for each traveler separately. Only the traffic simulation has tight interaction between the travelers. This will be considered in the following.

For clusters of PCs, the most costly communication operation is the initiation of a message (``latency''). In consequence, one needs to minimize the number of CPUs that need to communicate with each other. This is achieved with a domain decomposition (see Fig. 2(c)) of the traffic network graph. As long as the domains remain compact, each CPU will in the average have at most six neighbors (Euler's theorem for planar graphs). Since network graphs are irregular structures, one needs a method to deal with this irregularity. METIS (18) is a software package that specifically deals with decomposing graphs for parallel computation.

The quality of the graph decomposition has consequences for parallel efficiency (load balancing): If one CPU has a lot more work to do than all other CPUs, then all other CPUs will need wait for it, which is inefficient. For our current work with $\sim 100$ CPUs and networks with $\sim 20\,000$ links, the ``latency problem'' (see below) always dominates load balancing issues; however it is generally useful to use the actual computational load per network entity for the graph decomposition (19).

For shared memory machines, other forms of parallelization become possible, for example based on individual network links or individual travelers. One could have a dispatcher that distributes links for computation in a round-robin fashion to the CPUs of the shared memory machine (20); technically, one would use threads (21) for this. This would be called fine-grained parallelism, as opposed to the coarse-grained parallelism that is more appropriate for message passing architectures. As said before, the main drawback of this method is that one needs an expensive machine if one wants to use large numbers of CPUs.

Important numbers for parallel implementations are real time ratio, speed-up, and efficiency:

Real time ratio (RTR) - describes how much faster than reality the simulation is. For example, an RTR of 100 means that 100 minutes of traffic are simulated in 1 minute of computing time. This number is important no matter if the simulation is parallel or not.

Speed-up - describes how much faster the parallel simulation is when compared to a simulation on a single CPU. For the single CPU algorithm one either uses the parallel algorithm running on a single CPU, or an algorithm specifically tailored to a single CPU system.

Efficiency - is obtained by dividing Speed-up by the number $p$ of CPUs that were used.

These numbers are all related, but they carry different meanings. As we will describe in more detail later, the main restriction that we are up against is the latency of the Ethernet communication hardware. Under the assumption of 1-second time-steps and 2 message exchanges per time step, this latency imposes a limit on the real time ratio of about 150. That is, no matter what kind of simulation model one uses, how good the load balancing is, or how many CPUs one adds, that number is a fixed quantity for this type of simulation and this type of parallel computer.

Parallel Computing in Transportation Simulations

Parallel computing has been employed in several transportation simulation projects. One of the first was PARAMICS (22), which started as a high performance computing project on a Connection Machine CM-2. In order to fit the specific architecture of that machine, cars/travelers were not truly objects but particles with a limited amount of internal state information. PARAMICS was later ported to a CM-5, where it was simultaneously made more object-oriented. In ref. (23), a computational speed of 120000 vehicles with an RTR of 3 is reported, on 32 CPUs of a Cray T3E.

About at the same time, our own work showed that on coupled workstation architectures it was possible to efficiently implement vehicles in object-like fashion, and a parallel computing prototype with ``intelligent'' vehicles was written (11). This later resulted in the research code PAMINA (5), which was the technical basis for the parallel version of TRANSIMS (19). In our tests (with Ethernet only, on a network with 20000 links, about 100000 vehicles simultaneously in the simulation), TRANSIMS ran about 10 times faster than real time with the default parameters, and about 65 times faster than real time after tuning. These numbers refer to 32 CPUs; adding more CPUs did not yield further improvement. The parallel concepts behind TRANSIMS are the same as behind the queue model from this paper, and in consequence TRANSIMS is up against the same latency problem as the queue model. However, for unknown reasons the computational speed is a factor of two smaller than predicted by latency alone.

Some other early implementations of parallel traffic simulations are (24,25). A parallel implementation of AIMSUN reports a speed-up of 3.5 on 8 CPUS using threads (which uses the shared memory technology as explained above) (26).

DYNEMO is a macroparticle model, similar to DYNASMART described below. A parallel version was implemented about two years ago (27). A speed-up of 15 on 19 CPUs connected by 100 Mbit Ethernet was reported on a traffic network of Berlin and Brandenburg with 13738 links. Larger numbers of CPUs were reported to be inefficient. The speed-up corresponded to a real time ratio of 6, implying that it was not the latency that caused the inefficiency, since the latency inefficiency should not set in before the RTR reaches 150 as explained earlier.

DynaMIT uses functional decomposition (task parallelization) as a parallelization concept (16). This means that different modules, such as the router, the traffic (supply) simulation, the demand estimation, etc., can be run in parallel, but the traffic (supply) simulation runs on a single CPU only. Functional decomposition is outside the scope of this paper. DYNASMART also reports the intention to implement functional decomposition (17).


next up previous
Next: POSSIBLE TRAFFIC MODELS Up: queue-parallel Previous: INTRODUCTION
Kai Nagel 2002-11-16