Next: Distributed computing and truly Up: Parallel computing Previous: Other modules Contents

Summary

This paper explains the parallel implementation of the Transims micro-simulation. Since other modules are computationally less demanding and also simpler to parallelize, the parallel implementation of the micro-simulation is the most important and most complicated piece of parallelization work. The parallelization method for the Transims micro-simulation is domain decomposition, that is, the network graph is cut into as many domains as there are CPUs, and each CPU simulates the traffic on its domain. We cut the network graph in the middle of the links rather than at nodes (intersections), in order to separate the traffic dynamics complexity at intersections from the complexity of the parallel implementation. We explain how the cellular automata (CA) or any technique with a similar time depencency scheduling helps to design such split links, and how the message exchange in Transims works.

The network graph needs to be partitioned into domains in a way that the time for message exchange is minimized. Transims uses the METIS library for this goal. Based on partitionings of two different networks of Portland (Oregon), we calculate the number of CPUs where this approach would become inefficient just due to this criterion. For a network with 200000 links, we find that due to this criterion alone, up to 1024 CPUs would be efficient. We also explain how the Transims micro-simulation adapts the partitions from one run to the next during feedback iterations (adaptive load balancing).

We finally demonstrate how computing time for the Transims micro-simulation (and therefore for all of Transims) can be systematically predicted. An important result is that the Portland 20024 links network runs about 40 times faster than real time on 16 dual 500 MHz Pentium computers connected via switched 100 Mbit Ethernet. These are regular desktop/LAN technologies. When using the next generation of communications technology, i.e. Gbit Ethernet, we predict the same computing speed for a much larger network of 200000 links with 64 CPUs.

[[in particular look at Nurhan's ``mistakes'' and clarify this stuff. In particular, clarify where on a beowulf the time is spent, and how this degrades performance if one uses a master etc.]]

[[maybe also explain quantify.]]

[[multi-crit metis??]]

[[threads?? at least mention and discuss.]]

Next: Distributed computing and truly Up: Parallel computing Previous: Other modules Contents

2004-02-02