Communication between the modules

Traditional implementations of transportation planning software, even when microscopic, are monolithic software packages (e.g. PAMINA (Rickert, 1998), EMME/2 (Babin et al., 1982), VISSIM (VISSIM www page, accessed 2003)). By this we do not dispute that these packages may use advanced modular software engineering techniques; we are rather referring to the user view, which is that one has to start one executable on one CPU and then all functionality is available from there. The disadvantage of that approach is twofold:

A first step to overcome these problems is to make all modules completely stand-alone, and to couple them via files. Such an approach is for example used by TRANSIMS (TRANSIMS www page, accessed 2003). The two disadvantages of that approach are:

The approach taking in this project is to couple the modules by messages rather than via files. In this way, each module can run on a different computer using different CPU and memory resources. It is even possible (as is with file-based interfaces) to make the modules themselves distributed; for example, we have mobility simulations (for traffic) which run on parallel Myrinet-equipped clusters of workstations.

It is clear that this now allows real-time interaction between the modules: for example, if an agent is blocked in congestion, the mental layer modules can react to this new situation and submit new routes or activities during a simulation run.

As our system is made up of many different module types and is designed to scale up to very large simulations it is important to carefully consider how the modules would communicate with each other. On simulations with tens of millions of agents, issues such as bandwith usage, packet loss, and latency become increasingly important. As a result, we use different network protocols and implementations tailored to specific requirements of inter-module communication.

Events

On their way from the starting location (e.g. hotel) to their individual destination (e.g. restaurant, peek of mountain), a pedestrian encounters different situations which he might enjoy more or less. As mentioned earlier, these perceptions are, as ``events'', sent to the decision-making modules, which record those events and are now able to decide how much an pedestrian enjoyed his trip. Typical events are spatial (``How many mountains can I see from here?'') or computed directly by the simulation (``How many agents are near me?'').

The mental modules listen to all those events. Depending on their different functionalities, they extract different types of information from them, as described in Sec. 4.1.

It is important to note that the task of the simulation of the physical system is simply to send out events about what happens; all interpretation is left to the mental modules. In contrast to most other simulations in the area of mobility research, the simulation itself does not perform any kind of data aggregation. For example, link travel times are not aggregated into time bins, but instead link entry and link exit events are communicated every time they happen. If some external module, e.g. the router, wants to construct aggregated link travel times from this information, it is up to that module to perform the necessary aggregation. Other modules, however, may need different information, for example specific progress reports for individual agents, which they can extract from the same stream of events. This would no longer possible if the simulation had aggregated the link entry/exit information into link travel times.

Despite this relatively clean separation - the mobility simulation computes ``events'', all interpretation is left to mental modules - there are conceptual and computational limits to this approach. For example, reporting everything that an agent sees in every given time step would be computationally to slow to be useful. In consequence, some filtering has to take place ``at the source'' (i.e. in the simulation), which corresponds to some kind of preprocessing similar to what real people's brains do. This is once more related to human intelligence, which is not well understood. However, also once more it is possible to pragmatically make progress. For example, it is possible to report only a random fraction of the object that the agent ``sees'', or it is possible to delegate the analysis of the views to a separate module (Sec. 4.2). Calibration and validation of these approaches will be interesting future projects.

**Figure 8:** Agents hiking from a hotel to the top of a mountain. They report what they see during the hike. These *events* are rendered in different colors here; green, for example, means that the agents enjoyed a forest.
$\includegraphics[width=0.80\hsize]{events.AB-gz.eps}$

Possible Protocols

This section discusses different possible implementation techniques for the messages. For some of the more promising technologies, the technological limitations are discussed, and they are measured in practice with our particular application in mind.

Existing Message Passing Tools

When a single module is distributed across multiple computational nodes, one often uses MPI (MPI www page, accessed 2003) or PVM (www.epm.ornl.gov/pvm/pvm_home.html, accessed 2003). For example, our traffic simulation module (not described in this paper) is distributed to 64 hosts or more. It is also possible to use MPI for the communication between the modules, as described in this paper. That approach has, however, the disadvantage that one is bound to the relatively inflexible options that MPI offers. For example, options to add or remove modules have only recently been added to the MPI standard, and multicast (see below) is not possible at all.

Reliable Data Streams

TCP is a connection based, reliable IP protocol. Initially, a connection from the sender to the receiver must be opened. With this connection, both sides can send their messages as the connection is symmetric. TCP guarantees that the messages arrive in correct order, without errors. - The main disadvantage of TCP is that a connection must be opened for every host pair. If one side closes the connection (due to a program crash or a system reboot), the other side has to reopen the connection. This requires close attention from the programmer to handle all cases in a large system gracefully.

Unreliable Data Packets

UDP offers, in comparison to TCP, no control for packet loss. UDP can be used to transmit single packets, but there is no guarantee that the sent packets will arrive. This is not always a disadvantage when compared to TCP's overhead for arrival checking. A message that arrives, however, is guaranteed to be error free, since the ethernet layer includes a checksum.

The amount of packet loss is strongly dependent on the overall number of packets in the network. In state-of-the-art networks, which today are often 1 gigabit ethernet, there is hardly any packet loss in the network itself. Losses occur mainly in the sending and receiving network cards, due to overflowing buffers. This is the case, for example, if the CPU is busy so that it cannot read the packets from the buffer in time. The more packets that are sent, the higher the chance that one is lost.

With Gbit Ethernet communication, up to 160'000 packets can be sent per second without any losses (Fig. 9 right). This means that we are able to distribute that many events per second from the simulation to other modules. Since the mobility simulation runs more than 100 times faster than wallclock-time, this results in 1'600 events per simulated second. This is not that much, however, if you keep in mind that there are 500 or more agents in the simulation which report their perception.

On a cluster with 100 Mbit Ethernet communication, the number drops to 100'000 packets per second, resulting in 1'000 events per simulated seconds. It is at this point unclear why the difference in bandwidth is not a factor of 10, as the difference between 100 Mbit and 1 Gbit would imply.

There are situations in which there is no need to retransmit lost packets. For example, if an agent reports that he is blocked in unexpected congestion (e.g. waiting for a cablecar), he needs a new route instantly. If its request is lost or delayed, it makes no sense for the system to buffer its request, since the agent has moved on, and the location in the original request might now be invalid. A new route computed based on the old information will be invalid as well. It is the agent's resposibility to restate its position again if it does not receive a new route after a certain time has elapsed (Gloor, 2001).

It is easier to deal with error conditions using UDP. We use UDP to transmit the agent positions from the recorder/player module to the visualizers. If the network is down for a few seconds, the simulation does not need to slow down because of lost packets. Once the viewer is back on line, it will receive the latest positions.

**Figure 9:** UDP packets sent at different rates using 100 Mbit/s (left) and 1 Gbit/s (right) networks, plotted as sent vs. received messages. Some packets are lost in the network, but most of them are lost in the overflowing buffers of the NICs. Each UDP packet can carry one event from the mobility simulation, e.g. a new position of an agent.
$\includegraphics[angle=270,width=0.5\hsize]{lossrate.xib-gz.eps}$ $\includegraphics[angle=270,width=0.5\hsize]{lossrate.cop-gz.eps}$

Multicasting

Often there is a need for sending the same packet to more than one receiver. This can be achieved by opening multiple TCP connections or, more easily, by sending multiple UDP packets to the receivers. However, on large simulations, the network interface card (NIC) of the sending host quickly becomes the bottleneck, as it is unable to send out enough packets to keep the receivers fully occupied.

There is the possibiliy to use multicasting to send packets to every host on the local network. Multicasting is a relatively recent addition to the network standards; it is particularly useful for any kind of streaming data such as radio or television broadcasts over the network. Its advantage is that the multiplication of the packets for multiple receivers is not done by the NIC but by the network itself. This avoids the NIC bottleneck.

A drawback with multicasting is that it has, similar to UDP, no arrival control. That is, there is no feedback to the sender if all packets arrived at all destinations, or even at any destination at all. In consequence, this is not useful when message arrival needs to be guaranteed.

To live with this problem, it often is possible to implement some sort of flow control into the application. This seems to be a hard task and might introduce performance issues. But often there is no need for the full flow control available in TCP, and a lightweight solution can increase the performance substantially. This task is simplified by the fact that in most local networks packet loss is almost zero if you do not saturate the network.

Multicasting provides groups of hosts, that are referenced using special IP addresses (higher than 224.0.0.0). The sender chooses one of these groups and sends a single packet to this IP address. A receiver must explicitly join a group first, telling its NIC and the operating system to listen for packets sent to this group.

An advantage of this addressing scheme is that the sender does not need to know the IP address of the receiver. This simplifies the configuration of the system substantially. The internet routers ensure that the packets find their way from the sender to the receivers, once they are registered to the multicast group.

Sending agent data to the viewers is an instance where multicasting is extremely effective. With our current pedestrian datasets, where 500 agents are simulated, each viewer requires 1.5MBit/s of bandwidth. By using multicasting, this bandwidth can be effectively shared between viewers, especially when they are viewing approximately the same location. As we build our datasets to a realistic scale (thousands of pedestrians), this bandwidth saving will become increasingly important.

The project presented here is a collaboration between two institutes at ETH Zürich. One of them is located more than 5 km away from where our computational cluster is. For every viewer that is connected to the simulation, extra bandwith is needed. Using multicast, we cannot reduce the bandwidth used for one viewer. But as soon as there are multiple viewers looking at the same general area, the bandwidth remains almost constant.

File Based Communication

When integrating existing modules from other simulation systems, it is often not possible or desirable to make modifications to their code. Typically, these systems use input and output files to pass data between modules.

Since we do not want to change our implementation for every ``foreign'' module, we use a wrapper around such modules. This wrapper reads the input from our modules as messages over the network, converts it into files understandable by the foreign module, and executes the foreign module. The corresponding output file is then read by the wrapper and reintegrated into our network based system.

Another use of such a wrapper is as a substitute for modules that have not yet been implemented. As we have not completed the activities generator module yet, we use a simple file, which contains ``handmade'' activity chains. A wrapper reads this file and sends the chains to the activity database module.