As is well known, there is a mutual dependence between strategy generation and mobility simulation. For example, congestion is the result of (the execution of) plans, but plans are based on (the anticipation of) congestion. The traditional approach to this kind of problem, both in transportation and in economics, has been the postulation of a Nash Equilibrium (NE). For route assignment, a NE is reached when no traveler can improve its travel time by selecting a different route.
In the traditional static assignment approach, under some circumstances it can be proven that there is only one solution (in terms of the link flows) to this problem (e.g. 13,52). In consequence, any computational procedure which finds that solution is a valid one.
An extension is the so-called Stochastic User Equilibrium (SUE;
e.g. 13,52). Instead of selecting the fastest
path, travelers select a path according to a probability
Static assignment does not possess any dynamics, that is, traffic is represented by time-independent streams. This precludes, for example, the representation of queue spill-back, or the representation of time-dependent traffic management strategies such as those used by ITS. As a result, newer work in this area, in particular when related to ITS, has used a dynamic representation of traffic.
Is is possible to maintain the NE or SUE problem statements for the dynamic approaches, in the sense that for a given departure time, different paths have different (average) travel times, and the traveler either selects the fastest path (NE) or she/he selects according to Eq. (1). The solution is, however, no longer unique (19), and therefore the solution process is no longer a purely mathematical exercise but now also has behavioral aspects. Two agent-based approaches, which allow a behavioral interpretation, are described next.
Both basic static assignment and static SUE define a state of the system, but no algorithm to get there. Since, as noted above, the solution is unique (in terms of the link flows), any algorithm which solves the problem is valid. Most, if not all, of the algorithms turn out to be iterative.
Similarly, a possible way to solve the consistency problem between mobility simulation and strategy generation is to use systematic relaxation (8,34,39). This is a cycle in which all agents select plans (e.g. route plans), execute them in the traffic simulation, then some agents revise their plans, and they are executed again, etc., until some kind of stopping criterion is fulfilled. A possible version of this is:
One problem with the basic approach is that it gives rise to oscillations from one iteration to the next: If, say, route A is faster in one iteration, then many travelers will switch to it, making it slower, and some other route faster. These oscillations can be suppressed by making the replanning fraction smaller with higher iteration numbers, but this is implausible with respect to real-world systems, where one would assume that the replanning fraction is given by individual behavioral rules, not by a system-wide requirement.
An alternative approach is to use Eq. (1) to chose between different plans. Where, however, do the different plans come from? In the following, we present an approach where additional plans are found by the system as it goes, and are added to the repertoire. In addition, this exploration is done by each agent individually; this has particular advantages when agents are very diverse, as exemplified for example by a high-resolution network (each link is a possible starting and ending point), and a quickly changing temporal dynamics. This approach is called the ``agent database''.
The agent database gives the agents a memory of their past plans, and the outcome (performance) of those plans. The agent database stores the performance of a plan as a numerical ``score'' associated with that plan. The score can be, for example, the travel time of the plan, or can be calculated from some utility or fitness function. The specific steps in the relaxation cycle are:
An advantage of the agent database is that the system is considerably more robust than without (48,47). Without the agent database, it is imperative that the strategy generation modules generate plans which are an improvement over the previous solution. This means, for example, that the router needs to generate different paths with probabilities that reflect actual use of those different paths. This puts very high design requirements on the strategy generation modules that will be very difficult to fulfill. Use of the agent database means that the strategy generation modules can be much more creative in the generation of new plans: Plans which turn out to be bad plans will just be evaluated once by the agent and then never be touched again.
The agent data-base brings our agent-based simulation closer to a classifier system as known for Complex Adaptive Systems (54). From here, alternative methods of agent learning, such as Machine Learning or Artificial Intelligence, can be explored.
Figure 2 shows an example of how the feedback mechanism works. In this scenario, many travelers travel from ``home'' to ``work'', as indicated in the figure. Travelers have nine different options, all with the same characteristics. Initially, all agents use the central route. Over time, they learn about the other options. Eventually, the system reaches a state that is similar to a Nash (or user) equilibrium: No matter which path the agents take, all have approximately the same travel time. Once also notices in the figure that the situation is not fully symmetric between the different paths. This is due to the stochasticity both in the choice behavior and in the traffic simulation. More details about this can be found in Raney and Nagel (49).
[Before replanning]
[After replanning]
|
The literature (e.g. 12) discusses the difference between day-to-day and within-day replanning. The difference is that in the former, the agents modify their plans only ``over night'' (i.e. between runs of the mobility simulation), while in the latter they modify their plans while the mobility simulation is running.
It is obvious that for the simulation of ITS systems, some version of within-day replanning is necessary, because otherwise travelers in the simulation simply do not react to ITS measures. There are essentially three ways to implement within-day replanning into a multi-agent system:
To make matters worse, the conceptual decision is interwoven with the computational implementation. A straightforward implementation implements within-day replanning as a subroutine of the mobility simulation. This, however, means that is has to be written in the same programming language. In addition, it means that it will use the same memory and CPU as the mobility simulation, which may not be possible for large scale scenario. That is, in our view at this point no robust technology exists for the implementation of within-day replanning into multi-agent (traffic) simulations that is at the same time modular and large-system capable. Future plans relating to this issue are discussed in Sec. 7.
Despite some efforts both in computer science (e.g. 57,23) and in traffic (11,8,12), a general theory of multi-agent learning seems to be missing. Some aspects of agent learning are covered by behavioral aspects, which are treated in other parts of this workshop. Here, we will concentrate on how multi-agent learning interacts with the computation. There are two sub-problems to solve: single-agent learning, and multi-agent co-evolution. We will treat them one by one.
Here, one should consider the problem of a single learning agent when confronted with a stochastic environment. In order to make this problem better defined, let us assume that this environment is stationary; that is, the random quantities do not ``drift''. An example, relevant to traffic, would be a traveler who is going through a typical weekday over and over again and needs to find a good or the best solution in terms of activities, mode choice, and route choice.
This problem can be formulated as a reinforcement learning problem (RLP). A typical formulation of the RLP is that there are states, , and that for each state there are several possible actions . The actions influence the state transition probabilities, i.e. . For each state-action-pair there is a reward, . The goal of the agent is to find actions such that the time-averaged reward flow is maximized.1 For traffic, a state could for example be ``being at intersection X at 8am'' or ``having worked for seven hours at 4pm''. A transition could be the next network link to take, or a transition to another activity (16).
A possible solution method for the RLP is Q-learning (e.g. 50). In Q-learning, the agent samples all possible transitions very often, while back-propagating future rewards when taking that transition. A detailed description is beyond the scope of this paper. Q-learning solves the RLP optimally only when the system is ergodic, that is, when any state can be reached from any other state, via a sequence of transitions. In practice, one also needs that the state space cannot be too large, because otherwise sampling of the complete state space becomes impossible in plausible time, and Q-learning will find a solution that is locally but not necessarily globally optimal. It is this second problem that haunts us in traffic simulation: It will not be possible to explore, for each agent in the traffic simulation, the complete state space of all options. In this situation of limited search time, mathematical properties of the system such as said ergodicity, or the fact that a Markovian system goes to a steady-state density, are no longer useful, and building a useful theory becomes considerably more difficult.
Q-learning is a useful general purpose algorithm to illustrate the problem, but it fails when the search space becomes too large. In such cases, it is necessary to look for alternatives. For example, Q-learning for route generation is orders of magnitude slower than the Dijkstra algorithm (Sec. 3). In other words, when the structure of the problem is known, then better algorithms than Q-learning can be used. All these algorithms fall under the category best reply algorithms, since they find the best reply of the agent to the last iteration.
Often, it is behaviorally not plausible that agents always select the best option. In such cases, discrete choice theory (7) assigns probabilities to all possible options, and the agent makes a random draw according to those probabilities. The main disadvantage of discrete choice theory is that the necessity to compute probabilities for all options makes it even harder to compute than best reply algorithms.
Since optimal solutions cannot always be computed, there is a growing number of heuristics that find good but non-optimal solutions. Those algorithms are often inspired by evolutionary biology and are therefore called evolutionary or bio-inspired algorithms; typical exponents are Genetic Algorithms (17,29) or Ant Colony Optimization. As long as such algorithms are applied to model human behavior, the fact that they do not find optimal solutions is not a disadvantage. Nevertheless, there are open questions with respect to validation and calibration.
Finally, there are path dependent algorithms, where what the agent does depends on what the agent knows (mental maps). For example, the agent can memorize which parts of the network it has already seen, and prefer in future decisions those parts of the network. Such algorithms are currently implemented at some places (35,2), but their effect on large scale transportation planning applications is entirely untested.
Note once more that all of these approaches can also be used for the strategy generation of a traffic management center. The only difference is that a TMC will put more emphasis on exact and good algorithms, and less on behavioral realism.
In traffic, it is not true that a single agent is learning while all other agents have fixed strategies; rather, all agents learn simultaneously. This is typically termed co-evolution. It means that a learning agent is no longer faced with a stationary environment, but one which ``drifts'' because the other agents also learn.
A possible theory for co-evolutionary systems is developed by (31). That book essentially concentrates on biological systems. Different strategies are represented by different species. Encounters between species are rated by different scores (``fitness''), and the replication rate of species depends on these scores. Consider for example the so-called hawk-dove game, where encounters between two hawks result in severe negative scores for both, encounters between two doves result in zero scores, and encounters between a hawk and a dove result in a positive score for the hawk and a negative score for the dove. Clearly, a population of doves can be ``invaded'' by a hawk, while a population of only hawks cannot survive. A steady state can be reached with a small number of hawks (which ensures that hawks encounter each other only rarely) and a large number of doves. In this theory, the learning system is just an example of a dynamical system. As is well known, a dynamical system can reach fixed point attractors, periodic attractors, or chaotic attractors. As it turns out, for certain learning dynamics the fixed points are Nash Equilibria, meaning that one recovers some of the solutions of static game theory. However, the other attractors do not correspond to anything known from static game theory, which implies that simulations may display a long-term behavior that does not correspond to anything known from static game theory.
Although this shows a possible way to go for a theory for multi-agent learning, it is not possible to use this theory directly. The assumption for the theory is that there are a number of members of a certain species, and their scores decide about increase or decrease of these numbers. Translated to traffic, this would mean that travelers with unsuccessful strategies get weeded out while ones with successful strategies replicate. Although this could work as a metaphor, this is certainly not directly useful for real-world implementations.
An issue with co-evolution is that different agents can learn with different learning speeds. This becomes particularly critical once there are entities on different hierarchy levels of the design. For example, there could be travelers that adapt to measures of a traffic management center, and the traffic management center could react to the behavior of the travelers. Depending on ``who learns faster'', the results can be quite different (e.g. 42,59). The issue is related to ``sequential games'', ``subgame perfection'', and ``Stackelberg games'' in game theory.
As a somewhat extreme illustration, let us assume that there are two routes to a destination, and both are served by greedy road pricing agencies that attempt to maximize profit. If one agency raises prices, revenues will initially go up. But eventually more travelers will use the alternate route and (assuming there are no additional congestion effects) the agency may make less money than before. In consequence, the agency is driven back to the lower price. If, however, the agency evaluates the effect of the price change immediately after one day, it will conclude that the price increase was successful, and will stick with it.