Taking advantage of both SMP and distributed computing: Parallelizing the automatic creation of robocup soccer programs on the CLUMPs
CS 267 -- Project Proposal
Distributed workstations (such as in
the NOW) are appealingly cheap due to their off the shelf hardware, but
proponents of the SMP shared memory architectures argue that the programming
ease and low communication speeds of their systems more than justifies
their expense. One partial compromise between these systems is to
use clusters of SMP machines, such as the Berkeley CLUMPs, whereby applications
can take advantage of both the high communication speeds and shared memory
system of the SMP and the savings from using a comparatively low-speed
network to connect the distributed SMPs. Despite the appeal of this
idea, it appears that few researchers have found applications that truly
benefit from the hybrid approach.
This projects aims to demonstrate such
an application.
Robocup, the Robot World Cup Initiative,
is a challenge to the artificial intelligence and intelligent robotics
community to create teams capable of playing soccer, in both the real-world
domain using robotic players, and in the virtual domain using a soccer
simulator. The simulator league is by no means a toy problem -- the
simulator is complex, modelling wind, rain, endurance, etc. accurately.
The teams consist of 11 separate soccer client programs that communicate
only through the server. The games run in real-time, and the length
of the games for tournament play is set at 10 minutes
per game. Given the complexity of the problem and the slowness of
the simulations, attempting to use machine learning methods to automatically
synthesize soccer playing programs is a compelling challenge. Given
that most machine learning methods require a great deal of experience (often
thousands to millions of games), the time scale of 10 minutes per game
is out of the question. Thus, the question is to what degree can
the length of simulation be reduced by parallelization?
Genetic programming is a powerful technique
for automatic program synthesis. Essentially, genetic programming
is a parallel search technique modelled after natural
evolution that automatically synthesizes computer programs to solve a given
problem. Genetic programming has proven quite
effective in recent years, producing solutions to a variety of different
problems in different fields, including machine learning, mechanical engineering,
analog circuit design, molecular biology, pattern recognition, theorem
proving, and cellular automata. Genetic programming
is a technique that seems to be able to solve harder problems given more
computational resources. Numerous approaches to parallelizing genetic
programming have been taken, and most have been quite successful, often
showing linear or even super-linear speed up. (Look here
for more information on GP and parallel GP). However, most successful
applications of genetic programming have utilized simulations or program
evaluations that take on the order of a second of compute time on a modern
processor. Even then, most applications of GP on difficult problems
have still required considerable (hours to days) computational effort.
Thus, without improving the performance of the simulator, it seems hopeless
to attempt to learn successful soccer agents using genetic programming.
In order to truly show off the benefits of a distributed SMP system, we require an application that can take advantage of both layers of parallelism. The application should have some parallelism over a tight time scale involving a relatively large amount of communication, and some parallelism utilizing a slower time scale and presumably a smaller amount of data. As we will see, learning robocup players using parallel genetic programming should fit this bill. Essentially, at the level of speeding up the Robocup simulator, we can use the SMP to parallelize both the execution of the player agents and the server itself. At the level of evaluating a population of teams, we can use the slower communication facilities of the distributed network.
The robocup soccer simulator (available
here) is a client
server system where the server runs the simulation and there is a separate
client for each soccer player. The communication in the standard
system takes place through UDP/IP sockets, and so the clients can be written
in any system that has a UDP/IP interface. The simulator handles
the updating of the positions of the players on the field based on the
commands of the player and sends the players noisy perceptual information.
The simulator of course also keeps track of the positions of the ball and
the endurances of the players. The clients can do any arbitrary calculation
in their code, but cannot communicate with one another
except through the server.
The server runs the simulation through
timesteps (100ms real time in the standard version), and the simulator
will only execute one action (with a few exceptions) per agent per timestep.
The server sends perceptual information (visual, auditory(shouts of teammates),
and proprioceptive (how tired I am, how many commands I did last timestep,
etc) to the players on a different time scale (approximately every 300ms
real time). The timing of the agents is critical and is a complication
for this project. The clients can send motion commands
such as (dash, turn, kick) as well as shouts to other players (shout).
To successfully speed up the system, we
will probably need to use something other than UDP/IP communication (although
we may need to simulate the possibility of dropped packets). A large
portion of the time of execution is spent in the simulator, and thus we
probably can achieve further speedups by parallelizing the simulator itself.
The simulator is mostly particle based, and thus we can apply many of the
techniques that we have learned. Of course, we can run each of the
clients in parallel, but probably can achieve speedups by managing the
context switching ourselves rather than trusting the OS to do this for
us.
Although parallelizing genetic programming is fairly straightforward
(as discussed here), there are several assumptions
might potentially limit its applicability. First, it is known that the
population size of a breeding group (deme) has to be of a certain
minimum size, and that deme population sizes of at least 1000 are typically
desirable. These size requirements allow the population within a deme to
maintain enough diversity to continue to be an efficient search. Second,
it is assumed that runs require approximately 50 generations, and must
take place in under two days of real time (thus approximately 1 generation
must be processed per hour). For a population size of 1000, this means
that each individual can only take 3.6 seconds to fully evaluate. Thus,
the complexity of the fitness evaluation is fairly harshly limited. Even
assuming that we can get by with 300 individuals per deme, we still have
only about 10 seconds to evaluate each individual.
Thus, in order to utilize GP for
difficult problems, some sort of parallelization at the fitness evaluation
level is required. Although this has been somewhat explored in the field
of genetic programming (our work on evolving FPGA configurations,
for example), it is not the predominant method nor is it a solved problem.
Another possibility is to spread the demes out over multiple processors
and do the reproductive operations in parallel. This is clearly feasible,
although considerably more communication intensive than having only a single
deme per processor.
In order to test the distributed SMP system, we will compare with systems (both in theory and with experimental results) that utilize only one or the other of the technologies (i.e. either SMP or distributed alone). We expect that our system will allow for the fastest run times for the parallel system.
Reasons why distributed alone won't work
Reasons why SMP alone won't work