Next: A Practical Scenario Up: Large scale multi-agent simulations Previous: Strategy Generation

Subsections

Adaptation, Learning and Feedback

As is well known, there is a mutual dependence between strategy generation and mobility simulation. For example, congestion is the result of (the execution of) plans, but plans are based on (the anticipation of) congestion. The traditional approach to this kind of problem, both in transportation and in economics, has been the postulation of a Nash Equilibrium (NE). For route assignment, a NE is reached when no traveler can improve its travel time by selecting a different route.

Static approaches

In the traditional static assignment approach, under some circumstances it can be proven that there is only one solution (in terms of the link flows) to this problem (e.g. 13,52). In consequence, any computational procedure which finds that solution is a valid one.

An extension is the so-called Stochastic User Equilibrium (SUE; e.g. 13,52). Instead of selecting the fastest path, travelers select a path according to a probability

$\begin{displaymath} p_i \propto e^{-\beta \, T_i} \ , \end{displaymath}$

(1)

where

is the travel time of path

. Instead of

, some generalized cost

can be used. $\beta$ can be seen as a tuning parameter: For $\beta \to \infty$ , standard static assignment is obtained again; for $\beta = 0$ , all paths are selected with equal probability. The theoretical justification for $\beta$ stems from the assumption of a certain variability of the travel times - which can come from many sources, including real variability of the travel times, perceived variability of the travel times, or variability of so-called unobserved attributes. In practice, $\beta$ is best obtained as part of a multinomial logit model estimation from stated or revealed preference data (e.g. 7).

Dynamic approaches

Static assignment does not possess any dynamics, that is, traffic is represented by time-independent streams. This precludes, for example, the representation of queue spill-back, or the representation of time-dependent traffic management strategies such as those used by ITS. As a result, newer work in this area, in particular when related to ITS, has used a dynamic representation of traffic.

Is is possible to maintain the NE or SUE problem statements for the dynamic approaches, in the sense that for a given departure time, different paths have different (average) travel times, and the traveler either selects the fastest path (NE) or she/he selects according to Eq. (1). The solution is, however, no longer unique (19), and therefore the solution process is no longer a purely mathematical exercise but now also has behavioral aspects. Two agent-based approaches, which allow a behavioral interpretation, are described next.

``Basic'' agent-based feedback

Both basic static assignment and static SUE define a state of the system, but no algorithm to get there. Since, as noted above, the solution is unique (in terms of the link flows), any algorithm which solves the problem is valid. Most, if not all, of the algorithms turn out to be iterative.

Similarly, a possible way to solve the consistency problem between mobility simulation and strategy generation is to use systematic relaxation (8,34,39). This is a cycle in which all agents select plans (e.g. route plans), execute them in the traffic simulation, then some agents revise their plans, and they are executed again, etc., until some kind of stopping criterion is fulfilled. A possible version of this is:

The system begins with an initial set of plans, one per agent, based on some kind of initial guess.
The mobility simulation is executed with the current set of plans.
10% of the population requests new plans from the strategy level, which bases the decision on performance information from the mobility simulation. The new plans then replace the old plans for the ``replanned'' agents in the set of current plans.
This cycle (i.e. steps 2 through 3) is run many times, until some kind of stopping criterion is fulfilled.

The agent database

One problem with the basic approach is that it gives rise to oscillations from one iteration to the next: If, say, route A is faster in one iteration, then many travelers will switch to it, making it slower, and some other route faster. These oscillations can be suppressed by making the replanning fraction smaller with higher iteration numbers, but this is implausible with respect to real-world systems, where one would assume that the replanning fraction is given by individual behavioral rules, not by a system-wide requirement.

An alternative approach is to use Eq. (1) to chose between different plans. Where, however, do the different plans come from? In the following, we present an approach where additional plans are found by the system as it goes, and are added to the repertoire. In addition, this exploration is done by each agent individually; this has particular advantages when agents are very diverse, as exemplified for example by a high-resolution network (each link is a possible starting and ending point), and a quickly changing temporal dynamics. This approach is called the ``agent database''.

The agent database gives the agents a memory of their past plans, and the outcome (performance) of those plans. The agent database stores the performance of a plan as a numerical ``score'' associated with that plan. The score can be, for example, the travel time of the plan, or can be calculated from some utility or fitness function. The specific steps in the relaxation cycle are:

The system begins with an initial set of plans, one per agent. In the case of the results presented in Sec. 5, a plan is simply a route, and the initial set of routes is generated based on free speed travel times, which represent a network with no congestion.
For each agent, the new plan is stored in the agent database (48,47), which represents its memory of previously tried plans. Since the agents at this point have only one plan apiece, they automatically select this as their next plan to execute.
The mobility simulation is executed with the set of selected plans.
Each agent calculates the score of his/her plan based on the outcome of the simulation. In Sec. 5, ``performance'' will mean the total travel time of the entire trip, with lower travel times meaning better performance. This information is stored for all the agents in the agent database, associated with the plan that was used.
A fraction of the population (10% at present) requests new plans from the strategy modules, which base them on information from the last mobility simulation. The new plans are then stored in the agent database and, being new, are mandatorily selected by the agents.
Agents who did not request new plans choose a previously tried plan from the agent database, by comparing the performance values for the different plans, without knowing anything else about the plans. Specifically, they use a multinomial logit model $p_i \propto e^{\beta S_i}$ for the probability to select route , where is the corresponding memorized score and $\beta$ is an empirical constant. This process is explained in detail by Raney and Nagel (48,47).
This cycle (i.e. steps 3 through 6) is run many times, until some kind of stopping criterion is fulfilled.

An advantage of the agent database is that the system is considerably more robust than without (48,47). Without the agent database, it is imperative that the strategy generation modules generate plans which are an improvement over the previous solution. This means, for example, that the router needs to generate different paths with probabilities that reflect actual use of those different paths. This puts very high design requirements on the strategy generation modules that will be very difficult to fulfill. Use of the agent database means that the strategy generation modules can be much more creative in the generation of new plans: Plans which turn out to be bad plans will just be evaluated once by the agent and then never be touched again.

The agent data-base brings our agent-based simulation closer to a classifier system as known for Complex Adaptive Systems (54). From here, alternative methods of agent learning, such as Machine Learning or Artificial Intelligence, can be explored.

Illustration of learning and feedback

Figure 2 shows an example of how the feedback mechanism works. In this scenario, many travelers travel from ``home'' to ``work'', as indicated in the figure. Travelers have nine different options, all with the same characteristics. Initially, all agents use the central route. Over time, they learn about the other options. Eventually, the system reaches a state that is similar to a Nash (or user) equilibrium: No matter which path the agents take, all have approximately the same travel time. Once also notices in the figure that the situation is not fully symmetric between the different paths. This is due to the stochasticity both in the choice behavior and in the traffic simulation. More details about this can be found in Raney and Nagel (49).

**Figure 2:** Diagram of the testing network overlaid onto example snapshots of the same time-of-day (a) before and (b) after many iterations of route replanning. After replanning, the agents have spread onto the different available routes between home and work.
[Before replanning] $\includegraphics[width=0.55\hsize]{0-common-0608-fig.eps}$ [After replanning] $\includegraphics[width=0.55\hsize]{1000-routes-only-0608-fig.eps}$

Day-to-day vs. within-day replanning

The literature (e.g. 12) discusses the difference between day-to-day and within-day replanning. The difference is that in the former, the agents modify their plans only ``over night'' (i.e. between runs of the mobility simulation), while in the latter they modify their plans while the mobility simulation is running.

It is obvious that for the simulation of ITS systems, some version of within-day replanning is necessary, because otherwise travelers in the simulation simply do not react to ITS measures. There are essentially three ways to implement within-day replanning into a multi-agent system:

Agents (including the traffic management center, TMC) pre-compute conditional strategies. For example, a traveler could say that he/she follows a variable message sign if there is one, or the TMC could say that it will switch variable message signs in a certain way under certain conditions. The mobility simulation would then simply execute these conditional strategies.
Another implementation option is that, at predefined times, say every hour, the mobility simulation is stopped, and all agents (including the TMC) can decide if they would have changed their plans during the last hour. If so, the simulation is moved back one hour and run again with the revised plans. This is done over and over again, until at the end of the hour, no agent wants so change any more, at which point the simulation of the next hour will be started. The main advantage of this approach is that it can be implemented with the same tools as day-to-day replanning. Its main disadvantage is that it is computationally slow, and that it is conceptionally simple only when all modules are deterministic.
True within-day replanning: Finally, both the travelers and the TMC could be enabled to react while the mobility simulation is running. At any point in time, any agent (including the TMC) could stop the mobility simulation, replan its own strategy, and then start the mobility simulation again. More realistic implementations would explicitly take care of delays; for example, information from sensors at time may not be used before a time $t + \tau$ by the TMC.

To make matters worse, the conceptual decision is interwoven with the computational implementation. A straightforward implementation implements within-day replanning as a subroutine of the mobility simulation. This, however, means that is has to be written in the same programming language. In addition, it means that it will use the same memory and CPU as the mobility simulation, which may not be possible for large scale scenario. That is, in our view at this point no robust technology exists for the implementation of within-day replanning into multi-agent (traffic) simulations that is at the same time modular and large-system capable. Future plans relating to this issue are discussed in Sec. 7.

Toward a theory of multi-agent learning

Despite some efforts both in computer science (e.g. 57,23) and in traffic (11,8,12), a general theory of multi-agent learning seems to be missing. Some aspects of agent learning are covered by behavioral aspects, which are treated in other parts of this workshop. Here, we will concentrate on how multi-agent learning interacts with the computation. There are two sub-problems to solve: single-agent learning, and multi-agent co-evolution. We will treat them one by one.

Single-agent learning

Here, one should consider the problem of a single learning agent when confronted with a stochastic environment. In order to make this problem better defined, let us assume that this environment is stationary; that is, the random quantities do not ``drift''. An example, relevant to traffic, would be a traveler who is going through a typical weekday over and over again and needs to find a good or the best solution in terms of activities, mode choice, and route choice.

This problem can be formulated as a reinforcement learning problem (RLP). A typical formulation of the RLP is that there are states, , and that for each state there are several possible actions . The actions influence the state transition probabilities, i.e. . For each state-action-pair there is a reward, . The goal of the agent is to find actions $\tilde a(s)$ such that the time-averaged reward flow is maximized.¹ For traffic, a state could for example be ``being at intersection X at 8am'' or ``having worked for seven hours at 4pm''. A transition could be the next network link to take, or a transition to another activity (16).

A possible solution method for the RLP is Q-learning (e.g. 50). In Q-learning, the agent samples all possible transitions very often, while back-propagating future rewards when taking that transition. A detailed description is beyond the scope of this paper. Q-learning solves the RLP optimally only when the system is ergodic, that is, when any state can be reached from any other state, via a sequence of transitions. In practice, one also needs that the state space cannot be too large, because otherwise sampling of the complete state space becomes impossible in plausible time, and Q-learning will find a solution that is locally but not necessarily globally optimal. It is this second problem that haunts us in traffic simulation: It will not be possible to explore, for each agent in the traffic simulation, the complete state space of all options. In this situation of limited search time, mathematical properties of the system such as said ergodicity, or the fact that a Markovian system goes to a steady-state density, are no longer useful, and building a useful theory becomes considerably more difficult.

Q-learning is a useful general purpose algorithm to illustrate the problem, but it fails when the search space becomes too large. In such cases, it is necessary to look for alternatives. For example, Q-learning for route generation is orders of magnitude slower than the Dijkstra algorithm (Sec. 3). In other words, when the structure of the problem is known, then better algorithms than Q-learning can be used. All these algorithms fall under the category best reply algorithms, since they find the best reply of the agent to the last iteration.

Often, it is behaviorally not plausible that agents always select the best option. In such cases, discrete choice theory (7) assigns probabilities to all possible options, and the agent makes a random draw according to those probabilities. The main disadvantage of discrete choice theory is that the necessity to compute probabilities for all options makes it even harder to compute than best reply algorithms.

Since optimal solutions cannot always be computed, there is a growing number of heuristics that find good but non-optimal solutions. Those algorithms are often inspired by evolutionary biology and are therefore called evolutionary or bio-inspired algorithms; typical exponents are Genetic Algorithms (17,29) or Ant Colony Optimization. As long as such algorithms are applied to model human behavior, the fact that they do not find optimal solutions is not a disadvantage. Nevertheless, there are open questions with respect to validation and calibration.

Finally, there are path dependent algorithms, where what the agent does depends on what the agent knows (mental maps). For example, the agent can memorize which parts of the network it has already seen, and prefer in future decisions those parts of the network. Such algorithms are currently implemented at some places (35,2), but their effect on large scale transportation planning applications is entirely untested.

Note once more that all of these approaches can also be used for the strategy generation of a traffic management center. The only difference is that a TMC will put more emphasis on exact and good algorithms, and less on behavioral realism.

Co-evolution

In traffic, it is not true that a single agent is learning while all other agents have fixed strategies; rather, all agents learn simultaneously. This is typically termed co-evolution. It means that a learning agent is no longer faced with a stationary environment, but one which ``drifts'' because the other agents also learn.

A possible theory for co-evolutionary systems is developed by (31). That book essentially concentrates on biological systems. Different strategies are represented by different species. Encounters between species are rated by different scores (``fitness''), and the replication rate of species depends on these scores. Consider for example the so-called hawk-dove game, where encounters between two hawks result in severe negative scores for both, encounters between two doves result in zero scores, and encounters between a hawk and a dove result in a positive score for the hawk and a negative score for the dove. Clearly, a population of doves can be ``invaded'' by a hawk, while a population of only hawks cannot survive. A steady state can be reached with a small number of hawks (which ensures that hawks encounter each other only rarely) and a large number of doves. In this theory, the learning system is just an example of a dynamical system. As is well known, a dynamical system can reach fixed point attractors, periodic attractors, or chaotic attractors. As it turns out, for certain learning dynamics the fixed points are Nash Equilibria, meaning that one recovers some of the solutions of static game theory. However, the other attractors do not correspond to anything known from static game theory, which implies that simulations may display a long-term behavior that does not correspond to anything known from static game theory.

Although this shows a possible way to go for a theory for multi-agent learning, it is not possible to use this theory directly. The assumption for the theory is that there are a number of members of a certain species, and their scores decide about increase or decrease of these numbers. Translated to traffic, this would mean that travelers with unsuccessful strategies get weeded out while ones with successful strategies replicate. Although this could work as a metaphor, this is certainly not directly useful for real-world implementations.

An issue with co-evolution is that different agents can learn with different learning speeds. This becomes particularly critical once there are entities on different hierarchy levels of the design. For example, there could be travelers that adapt to measures of a traffic management center, and the traffic management center could react to the behavior of the travelers. Depending on ``who learns faster'', the results can be quite different (e.g. 42,59). The issue is related to ``sequential games'', ``subgame perfection'', and ``Stackelberg games'' in game theory.

As a somewhat extreme illustration, let us assume that there are two routes to a destination, and both are served by greedy road pricing agencies that attempt to maximize profit. If one agency raises prices, revenues will initially go up. But eventually more travelers will use the alternate route and (assuming there are no additional congestion effects) the agency may make less money than before. In consequence, the agency is driven back to the lower price. If, however, the agency evaluates the effect of the price change immediately after one day, it will conclude that the price increase was successful, and will stick with it.

Next: A Practical Scenario Up: Large scale multi-agent simulations Previous: Strategy Generation

2004-05-09