Multi-level parallelism in
automatically synthesizing soccer-playing programs for Robocup using genetic
programming
David
Andre
CS 267 – Course Project
dandre@cs.berkeley.edu
Spring 1998
To obtain this paper as a post script file, click here.
Introduction
Genetic Programming
Robocup
The implementation (stage
1)
The implementation (stage
2)
Empirical Results (single
SMP case)
The multi-clump
disk-based system
Empirical Results (multi-SMP case)
Conclusions and Future Work
Bibliography
Abstract:
Many of the various proposals for tomorrow's supercomputers have included
clusters of multiprocessors as an essential component. However, when designing
the systems of the future, it is important to insure that the nature of
the parallelism provided matches up with some relevant and important set
of algorithms. This project presents empirical program synthesis as an
algorithm that can successfully exploit the multiple levels of interconnect
present in an multi-SMP cluster system. When applying program synthesis
techniques to difficult problems, it is often the case that two distinct
levels of parallelism will emerge. First, many example programs must be
tested -- and can often be tested in parallel. This matches up with the
"slow" interconnect on a clump-based system. Second, the execution of a
particular program can often be parallelized, especially if the program
is complicated or requires interactions with a complex simulation. This
level of parallelism, in contrast to the first, often requires fine-grained
communication. Thus, this matches up with the "fast" level of the clump-based
system.
In particular, this project presents a multi-level parallel system for
the automatic program synthesis of soccer-playing agents for the Robocup
simulator competition using genetic programming. The system utilizes both
the fast shared-memory communication of the SMP system as well as a much
slower mechanism for the inter-SMP communication. The system is benchmarked
on a variety of configurations, and speedup curves are presented. Additionally,
a simple LogP analysis comparing the performance of the designed system
with a single-processor based NOW system is presented. Finally, the Robocup
project is reviewed and the future work outlined.
Introduction:
Distributed workstations (such as in the NOW) are appealingly cheap due
to their "commodity off the shelf" hardware, but proponents of the SMP
shared memory architectures argue that the programming ease and low communication
speeds of their systems more than justifies their expense. In addition,
some chip manufacturers (including Intel) are making dual and quad systems
as commodity parts (although in considerably smaller volume than the single-processor
boards). One partial compromise between these systems is to use clusters
of SMP machines, such as the Berkeley CLUMPs, whereby applications can
take advantage of both the high communication speeds and shared memory
system of the SMP and the savings from using a comparatively low-speed
network to connect the distributed SMPs. In fact, depending on the
memory configuration and the network utilized for a NOW-like system, the
SMPs might actually be cheaper (measured per processor). Some authors have
prognosticated that clusters of multiprocessors promise to be the supercomputers
of the future [Lumetta, et al.1997], and several large scale systems have
been built [ASCI, 1998]. Despite the appeal of this idea, it appears that
finding applications that benefit from the hybrid approach is a non-trivial
task. In many cases, algorithms must be modified to take advantage of the
fast intra-node communication.
-
This project presents empirical program synthesis as a class of applications
that can naturally benefit from the multi-level parallelism inherent in
the multi-clump design. The essential idea behind empirical program synthesis
is to generate programs to solve a problem by testing them in a test environment.
For many problems, this testing can be time consuming. Often, an entire
batch of programs must be tested before the program synthesis algorithm
can generate a new set of programs to examine. Thus, there are naturally
two distinct levels of potential parallelism: (1) the parallel execution
of the set of programs under consideration, and (2) any parallelism possible
in the execution of an individual program. Generally, if the execution
of an individual is reasonably complicated, perhaps involving a simulation
or many possible training cases, the bandwidth required for distributing
the programs is not large. In this case, the first level of parallelism
matches well with the "slow" communication on the network level interconnect.
The parallelism required within an execution of a program is likely to
require much more intense communication, and is thus well suited to the
shared memory interconnect within an SMP.
-
To illustrate this idea, we consider a highly parallel, program synthesis
method, called genetic programming (GP), that has shown both aptitude for
program synthesis across a wide range of problems [Koza, 1992; Koza, Andre,
Bennett, and Keane, 1998] and amenability to parallelization [Andre and
Koza, 1996]. The problem domain we consider for this project is the Robocup
soccer simulator, part of the international Robocup Challenge Project (Kitano
et al., 1997). Robocup is described in more detail below. To demonstrate
the system, we parallelized the Robocup soccer simulator (Noda, 1998) using
threads, and implemented a simple algorithm to distribute games over a
slow network. The results, both theoretical and empirical, indicate that
using a shared memory system for the parallelization of the simulator provides
excellent (even superlinear) speedup, while utilizing a disk-based distribution
system incurs only a minor penalty.
-
In addition to the main goal of demonstrating a multi-level application
suitable for a NOW of SMPs, this project had several other secondary goals.
First, the author is involved in a project to automatically synthesize
soccer agents for this year’s Robocup competition. The computational demands
of GP applied to Robocup are not negligible, and in fact, without parallelization,
the project has little chance of success. Thus, speeding up the process
is an important goal of the project. Second, given the nature of the computing
environment at most universities, many workstations are often idle overnight.
Additionally, most graduate students do not have long- term uninterrupted
access to the tens to hundreds of machines necessary for a realistic run
of GP on Robocup. Thus, we implemented a simple file-based scheme to allow
machines to be added and deleted during a run. This allows some flexibility
in obtaining compute cycles that might not otherwise be available. Finally,
we suggest that GP (and the general field of program synthesis) is a perfect
candidate for an algorithm that can take advantage of whatever large parallel
machines happen to exist in the future. Given the concern expressed by
some [David Bailey’s lecture] that the machines of the future and the algorithms
of today do not quite match, this finding is especially relevant.
-
The paper is organized as follows. First, we present a brief overview of
genetic programming and the issues relating to its parallelization. Then,
we present the Robocup soccer simulator and justification for why it is
a good choice for illustrating the multi-clump model. Then we discuss the
implementation of the code, following its progression through history.
Then, we present the results of doing a LogP analysis of the Robocup code,
and present empirical results for the single SMP system. Then, we describe
the disk-based methodology for the distribution of games over a network
of SMPs, and present some speedup curves for these results. Finally, we
conclude and sketch out the future work for the Robocup project.
Figure 1. An example of the crossover operation being performed on Boolean
trees.
Genetic Programming:
Genetic programming is a beam-search method for the automatic synthesis
of computer programs that is based on the ideas of natural evolution. Essentially,
GP is a method for the guided stochastic search of program space. Given
a description of the problem to be solved (a fitness function) and a set
of programmatic primitives, GP proceeds by first randomly generating a
population of programs. Then, GP tests the fitness of each program, and
then uses this information to recombine the programs using artificial analogs
of natural crossover, mutation, and natural selection. The programs in
GP are represented as program trees (which can be thought of as parse trees
or as LISP programs. An example of the crossover operation is shown in
Fig. 1. The computer programs in Fig. 1a,b are the parents for the crossover
operation, and were chosen probabilistically based on their fitness on
the particular Boolean function regression problem being solved. The crossover
operation exchanges a subtree from one parent with a subtree from the other
parent to create the two offspring. The heavy lines in Fig. 1 indicate
the subtrees that were exchanged to create the two children shown in Fig.
1c,d.
The programmatic ingredients need not be as simple as those shown in
Figure 1 – they can be as complex as iteration, memory, subroutines, and
signal processing primitives. Typically, a run of genetic programming consists
of the following steps:
-
Randomly create a population of programs using the programmatic ingredients.
-
Loop until a solution is found:
-
Execute each program in the population and evaluate its fitness on the
given problem.
-
Create a new population:
-
Chose parents for an operation (crossover, mutation) probabilistically
based on fitness.
-
Perform the operation and put the offspring into the new population.
Genetic programming has been analyzed as a form of stochastic beam search.
It has been successful on a wide range of problems, including problems
in molecular biology and analog circuit design.
-
One potential downside to GP is the large computational expense of solving
a difficult problem. Often, performing a run of GP will involve the testing
and thus execution of millions of programs. Naturally, parallelism has
been utilized as a means to combat the large computational expense. In
many cases, the method of parallelization has been to use the "deme" model.
This model consists of a large number of separate sub-populations of programs,
each evolving separately and occasionally exchanging a small number of
programs (migration). Although such demes can be utilized on a serial machine,
they are much more natural on a parallel system and provide a trivial means
of parallelization, given that the amount of migration is reasonably small.
Given that for most problems of reasonable complexity, this migration consists
of only 8 KB/s, it is certainly the case that even simple interconnects
can handle this inter-generational migration. Several successful systems
have been built using this model (Andre and Koza, 1998), and show linear
speedups by using a parallel hardware, and additional speedups from using
the deme model. Based on the success of these techniques, a 1000 node machine
of single processor DEC alpha boxes is being built this summer at Stanford
(Koza, 1998).
Although it is certainly the case that GP is embarrassingly parallel, there
are some limitations on its parallelization. First, the sub-populations
often need to be of reasonable size in order to obtain any likelihood of
success. Secondly, the length of time required for executing a given program
is often a problem for difficult domains (such as many types of simulation,
for example). For debugging and economic reasons, we often wish for GP
to succeed within a reasonably length of time (say, 2 days). This then
leads to the requirement that testing an individual program take on the
order of 1-10 seconds or less. For many domains, this is simply impossible
(molecular simulation, which can take days, even with some parallelization).
For other domains, getting into the right order of magnitude can be achieved
by shared memory parallelization.
-

-
Figure 2. Screen shot of the Robocup soccer server.
Robocup:
-
Robocup, the Robot World Cup Initiative, is a designed as a challenge to
the artificial intelligence and intelligent robotics community to create
teams capable of playing soccer, in both the real-world domain using robotic
players and in the virtual domain using a soccer simulator. The simulator
league is by no means a toy problem -- the simulator is complex, modeling
wind, rain, endurance, and shouting. Last August, teams from around
the world competed in Nagoya, Japan at the first Robocup annual competition.
-
The Robocup soccer simulator is a client server system where the server
runs the simulation and there is a separate client for each soccer player.
The communication in the standard system takes place through UDP/IP sockets,
and so the clients can be written in any system that has a UDP/IP interface.
The games run in real-time, and the length of the games for tournament
play is set at 10 minutes per game. The server runs the simulation
through timesteps (100ms real time in the standard version), and the simulator
will only execute one action (with a few exceptions) per agent per timestep.
The simulator handles the updating of the positions of the players on the
field based on the commands of the player and sends the players noisy perceptual
information. The simulator of course also keeps track of the positions
of the ball and the endurances of the players. The clients can do
any arbitrary calculation in their code,
but cannot communicate with one another except through the server.
The server sends perceptual information (visual, auditory (shouts of teammates),
and proprioceptive (how tired I am, how many commands I did last timestep,
etc) to the players on a different time scale (approximately every 150ms
real time). The clients can send motion commands such as (dash, turn,
kick) as well as shouts to other players (shout). Given the complexity
of the problem and the length of time of a game, attempting to use machine
learning methods to automatically synthesize soccer playing programs is
a compelling challenge. Given that most machine learning methods
require a great deal of experience (often thousands to millions of games),
the time scale of 10 minutes per game is out of the question. Thus,
the question is to what degree can the length of simulation be reduced
by parallelization?
The soccer server, as provided by the Robocup international community,
allows the timescale of the simulation to be sped up. Thus, on a fast machine,
games can run at 2-5 times real time without any change in the nature of
simulation. However, the amount of speed-up is limited by several factors.
The time scale must be slow enough for each player and the server (simulator)
to accomplish all of its tasks. For example, on a Sun ultra (167Mhz), it
turns out that the simulator may only be sped up by a factor of 2 or so
before some time-steps take "longer" than they should because of the amount
of computation required. Another problem is that the time-scale is not
adaptive – some timesteps are more computationally intensive than others,
but a fixed time-step size does not account for this and thus wastes time.
Thus, in order to achieve significant speedup, we determined that we had
to modify the simulator such that it could utilize adaptive time steps.
Learning Robocup players using GP is a perfect example for the clump-based
model. The simulation of a game requires a fast, tight model of parallelization
that allows for moderate communications. The algorithm for determining
which games to run is highly distributed, and requires considerably less
communication. This is thus well matched to the multi-level interconnect
of a NOW of SMPs.
The implementation (stage 1):
-
The project of applying GP to the Robocup domain began by using the soccer
server as provided by the international Robocup committee. In joint work
with Astro Teller from Carnegie Mellon University, we modified our GP system
to write out files in C that represented each individual in the population.
These programs consisted of programmatic ingredients such as memory, arithmetic,
subroutines, perceptual information, action-setting primitives, random
constants, and conditionals. The system then compiled these programs, and
ran them in the original soccer environment at 3 times real time on an
8-processor Sun Enterprise 5000 server (one of the clumps). Games would
take on the order of 3 minutes. Using this system, we were only able to
run with populations of size 200 (where a standard population size might
be 10,000). Additionally, the compiles took too long. Also, the SMP provided
less of a win here than we might have hoped, as the simulator dominates
the computation (as we will see below).
The next step was to modify the soccer simulator code itself to run the
players and server as a single process than took exactly the amount of
time needed for computation. In other words, the system was changed from
a real-time simulation to a computational simulation. The UDP/IP layer
was replaced by a communications class in memory, and all timing information
was no longer utilized. It was important in this process to double-check
the behavior of players such that they performed similarly under both simulators.
This ended up having some effect on performance – games now took on the
order of 140 seconds, rather than around 180. To parallelize this code,
the players were distributed out among different processors. We had to
add several barriers and a mutex mechanism in the communications class
to prevent race conditions. In this first version, the server would run
on one processor, and the other processors would each have a fair number
of players. This stage of the implementation required modifying the original
5000 lines of code to support parallelism (eliminating global variable,
etc), and adding the code for the communication class and parallel control
structure (barriers, etc). Overall, approximately 2000 lines of code were
added during this stage. The thread barrier routines provided by David
Martin for an earlier assignment were utilized.
| Server |
Player |
Coach |
| SimulateStep |
GetPerceptualInfo |
GetPerceptualInfo |
| Send PerceptInfo to Players |
Calculate Action |
Calculate Action |
| |
Send Command |
Send Command (if any) |
| Receive Commands |
|
|
Figure 3. Simple algorithm for parallelization.
-
At this point, the algorithm was as shown in figure 3. Simultaneously,
the server can be performing a simulation step while the players and coach
are receiving their perceptual information and determining their next actions.
The server then sends out the perceptual information (that will be received
on the next timestep), and then waits to receive commands from each of
the players. The coach only executes once every 10 time steps, so it doesn’t
factor into our analysis.
Figure 4. Timing analysis of the first stage of implementation.
With this version working, we did a timing analysis. It turns out that
very poor speedup was being obtained, because the server was taking 75%
of the time. The total time (without parallelization) for executing all
of the players was only 25% of the total time, and thus parallelization
was bounded harshly by the execution time of the server. The results of
the timing analysis are shown in figure 4. Sending the information to the
players takes the majority of the time. This action consists of scanning
the field for each player and determining which objects (other players,
the ball, lines of the field, etc) can be seen based on which direction
the player is looking. The server adds some noise to these percepts as
well. Contrary to initial predictions, the simulation step of the server
took very little time.
The implementation (stage 2):
Based on an analysis of the server code, it seemed that both the send and
receive phases of the server were 22-way parallelizable. With some modification
of the simulator code, the send phase could be performed independently
by many different processors, assuming that each processor had access to
the information about the field that was computed during the previous simulation
step. Similarly, the receive-phase could be parallelized as long as the
player information was then communicated to the processor performing the
send step. This analysis suggested that the "players" be split up among
the processor, where the notion of "player" is expanded to include the
code for sending and receiving. Thus, nearly all of the code can be parallelized
– only the 0.14% of the code contained in the simulator step remains a
necessary serial step.
Actually making this change to parallelize the various pieces of the
server required changing a relatively small amount of code, but it also
required a careful search to root out all of the global and static variables
hidden in the simulator. When multiple threads could be executing the same
piece of code, globals and static variables could cause significant problems.
Pseudo code for the revised algorithm is shown below:
while (GameNotOver()) {
barrier();
If (I_am_server)
SimStep();
barrier();
LoopOverPlayers(i from 0 to
21)
SendTo(i);
LoopOverPlayers(i from 0 to
21)
CalculateAction(i);
LoopOverPlayers(i from 0 to
21)
ReceiveFrom(i);
}
The coach can more or less be treated as another player, and, as mentioned
before, was excluded from our analysis. The communication required at each
of the stages is as follows. Between the Receive step and the SimStep,
the processor performing the SimStep must receive the information about
the player that was modified during the Receive step. This consists of
approximately 128 bytes from each player. Each processor, prior to performing
the SendTo step, must receive the field information (essentially a list
of the visible objects), comprising approximately 5K of data. Additionally,
this data is not distributed concurrently in memory, and is sent as 80
separate messages. The perceptual information sent with the SendTo step
comprises approximately 512B of data, but is only sent two out of every
three time steps. The information corresponding to the ReceiveFrom step
comprises only about 30B of memory. All in all, the communication requirements
consist of sending approximately 101 messages totaling 11K bytes, although
the exact totals vary depending on the system because of different minimum
message sizes.
Using this information, and the measured number of microseconds required
to perform a single step on a serial machine (43800), we performed a LogP
analysis of the system for the SMPs, the NOW, and for a mythical ethernet
system. We obtained the appropriate values for the parameters from papers
by the NOW group at Berkeley [Lumetta et al, 1997; Keeton et al, 1995].
Combining these numbers into alpha/beta values, we then obtained the following
timing curves.
Figure 5. Sequential LogP analysis of the SMP model.
Figure 6. Parallel (naïve) LogP analysis of the SMP model
Figure 6. "Realistic" analysis of the SMP model.
To model the communication strategy, we considered three cases: (1)
sequential access, (2) parallel (naïve) access, and (3) "realistic"
analysis. For the sequential access case, we assumed that only one set
of messages could be on the network at a given time. For the parallel case,
we assumed that all necessary messages could be on the network at a given
time. For the realistic graph, we assumed that the SMP had the memory bandwidth
to do all of its messages at a given time, but that the NOW and ethernet
models performed some sort of tree broadcast to require no more than 2
serial accesses. Figures 6, 7, and 8 show the timing curves predicted by
the 3 LogP models. As we will see, it turns out that the SMP actually probably
doesn’t quite have the memory bandwidth to support 6 or 7 simultaneous
messages.
Empirical Results (Single SMP):
After implementing the system and tweaking the compile flags to achieve
slightly better performance, we ran several tests of the single SMP system.
The timing/speedup curves are shown in Figure 7.
Figure 7. Time/speedup curves for a single game on a single SMP.
The achieved times are better than expected for 2-6 processors, but
take a dip for 7 and 8 processors. It seems that cache benefits allow super-linear
performance in the 2-6 case, but the communication overhead starts to overcome
the benefit of additional processors above 6. This might be a function
of the memory bandwidth restrictions on the CLUMPs, as not all of the memory
slots are filled, or, it might be related to load balance problems. The
load balance scheme utilized was static, and was certainly significant
for the larger numbers of processors – some processors would have 3/2 the
load of another processor, for example.
Figure 8. Percentage of time spent in each section of the code.
Figure 9. Absolute time spent in each section of the code.
Timing analyses were performed for the single SMP case as well. Figure
8 shows the relative time spent in each section of the code. The time spent
in barriers goes up considerably as the number of processors increases.
This mainly indicates time spent waiting for load imbalance, as the step
time is relatively insignificant, even when the number of processors is
large. The relative time spent on receiving doesn’t change much, but the
time spent on players and on sending decreases as the time spent on barriers
increases. Figure 9 shows the absolute time spent as the number of processors
increases. The receiving time nearly vanishes, the sending time consistently
decreases, and the player time decreases as well. In order to examine the
issue of why the speedup tapers off for 7 & 8 processors, we investigated
how each of the separate times changed as the number of processors increased.
This is shown in figure 10.
-
Figure 10. Speedup/slowdown curves for each part of the code in the single
SMP case.
In looking at the data represented in figure 10, we were surprised by several
items. First, the speedup curve for the sending cycle is almost a super-linear
straight line. This indicates no real slowdown caused by distributing the
field information to the different processors. Secondly, the speedup curve
for the player code is the curve that most dramatically tapers off at large
numbers of processors. Noticeably, it matches the overall curve and corresponds
with the increase in barrier time. Some of this time might well be due
to problems with averaging times over processors with slightly different
loads – but a similar curve is obtained even when viewing the minimum time
spent in player code for each thread. Thus, some factor, perhaps related
to code swapping in the instruction cache is causing a slowdown. The obvious
suspect is communication, but there is very little communication directly
in the player code – representing only the reading in of the messages from
the server. Given that a corresponding slowdown is seen in both the receive
code and the step code, this explanation makes some sense, but more detailed
analysis of what is occurring on the SMP is required. Detailed cache and
memory analysis would undoubtedly reveal the source of the problem.
A brief attempt at discerning the cause of the problem revealed that
the time spent per player undergoes an interesting change as the number
of processors increases. Figure 11 shows these results. The fact that the
time spent per player decreases at first is probably due to the benefits
of having multiple caches. The later slowdown could be due to the fact
that the costs of bringing the general player code into the cache are distributed
over fewer players per processor. This hypothesis could be checked with
more careful memory analysis.
Figure 11. Average and minimum cumulative time spent per player.
The multi-clump
system: a simple file-based interconnect.
In order to demonstrate the efficiency of the entire system, we designed
a simple client-server file-based mechanism for distributing teams that
need to be played in a game. Although clearly there exist faster and cleaner
mechanisms for communicating (MPI, SplitC, or even UDP/IP, as examples),
using a disk-based system was both easy to code and provides a very flexible
system for performing runs on a hybrid collection of machines. Additionally,
the file based system easily allows machines to be added or deleted during
the course of a run of genetic programming. Also, the system allows for
a diverse set of machines, such as the NOW (Berkeley’s network of single
processor ultra-sparcs), the CLUMPs (the Sun Enterprise 5000 servers),
and the MILLs (4-way Pentium SMPs), to be used together on a single run.
-
The system consists of a client that is the general GP engine, and a set
of game-servers that each run games in the soccer simulator. Prior to starting
the client, as many servers as desired are started. The servers each write
a unique identifier into a common shared subdirectory. The client reads
these identifiers when it starts up. The communication between client and
each server is mediated by a flag file, which stores the current state
of the interaction. When the client has written a pair of teams into a
file named according to the server’s identifier, the server also writes
a 1 to the file. The server waits for this flag to be set. Upon noticing
it as set, it reads in the teams and starts a game. When the server is
done, it writes the results of the game (some scoring information) into
a file named according to the server’s identifier, and writes a 0 to the
flag file. When the client notices this flag has been set, it reads in
the scoring information and, if possible, writes a new set of teams that
the given server can play.
The system allows for servers to be added or deleted during the course
of a run. This is accomplished by either adding or deleting the identifier
files from the common subdirectory. This system was implemented by writing
400 lines of code to perform the various file reading and writing tasks.
After obtaining some sample timing numbers, we calculated the effect
of using the disk-based system for various numbers of processors. . We
found that it took an average of 0.23 seconds to read and write the data
(including the flags, the teams, and the scores) for a single game. We
assumed a single disk system and serial access with no assumed overlap
of disk writes and computation. In reality, it is unlikely that all processors
would be waiting at the same time for the client to write files. Figure
12 shows the results of this analysis. When running with up to 64 nodes,
this worst case analysis indicates that only 10% of the time of a single
game would be spent on communication, still allowing for significant speedup
from 64 processors. Clearly, to go beyond 128 processors, either multiple
disk systems or the use of a faster communications system would be necessary.
Figure 12. Theoretical analysis of disk-time for multi-SMP system.
Empirical Results (Multi-SMP):
Using the file-based system described above, the performance of the system
was tested by running a set of 6 3000-timestep games on a 1-SMP system,
a 2-SMP system, and a 3-SMP system. The results are shown in figures 13
and 14. These figures both indicate that the multi-SMP model performs well,
achieving slightly super linear speedup in one case and slightly below
perfect linear speedup in the other set of runs. As predicted, the penalty
for using the disk for communication at the level of distributing the teams
causes very little penalty for small numbers of clumps.
Figure 13. Time/Speedup curves for the multi-SMP
system. Circled data points utilize multiple clumps.
Figure 14. Time/Speedup curves for the multi-SMP
system. This is not the same data as in figure 13, it represents a different
set of runs. In this case, the results are slightly below linear.
To further test the scalability and reliability of the disk-based system,
we tested its performance on the NOW, using only one thread per processor.
This test is of course only of a single level of the ideal parallel system
– the slow level, but it provides a test of scalability that cannot be
easily performed with the available set of clumps. The results of this
test are shown in figure 15.
Figure 15. Scalability test of the disk-based system on the NOW.
The results presented in figure 15 indicate that the speedup from utilizing
multiple processors using our disk-based mechanism allows for nearly linear
speedup. In fact, the performance at 16 processors is about 3.5 seconds
off of perfect linear speedup, which is approximately what was predicted
by the disk-based model. In addition, load balance was somewhat of an issue
– some processors appeared to be faster than others, which can account
for some of the failure to speedup. The performance at 8 and 4 processors
appears to be slightly super-linear, but this is probably an artifact of
random variance in the run times. More tests should be performed here to
validate the numbers and diminish the variance of the results.
Conclusions and Future work
The results presented in this paper represent a success for most of the
goals set forth for the parallel-programming aspect of the Robocup project.
By utilizing a multi-level parallel architecture, we can achieve scalable
and nearly linear speedup. Based on the results in this paper, it appears
that by using a 3-CLUMP system, we can achieve performance of running 72
games in 160 seconds, compared with a single game using our old system.
Thus, this work provides a critical speedup for the Robocup project.
However, there are several issues that could benefit from further work
on the parallel algorithm front. First, a better understanding of the source
of slowdown at 7 and 8 processors might enable further speedups. Also,
examining the reason for the large increase in barrier times could conceivably
attain greater performance. Second, a cleaner and better system for achieving
the network level parallelism should be investigated. Finally, the "slow"
network inter-SMP parallelism exploited here was at the individual program
level. It might worthwhile and important to add in the sub-population layer
of parallelism often exploited by genetic programming approaches.
For the Robocup project, the future work entails utilizing this system
as much as possible to attempt to learn soccer playing programs that will
be highly competitive in this year’s Robocup tournament. Also, we expect
to port the system to the 70+ existing nodes of the alpha-system being
built at Stanford by Dr. John Koza.
Finally, it should be noted that the main point of this paper is that
program synthesis (and not just GP) working on a complex problem (not just
Robocup) can benefit from and make use of a multi-level parallel system
such as a network of SMPs.
Bibliography
Accelerated Strategic Computing Initiative. (1996).
The Red Hardware. Information available at http://mephisto.ca.sandia.gov/TFLOP/sc96/node2.html.
Andre, D., and Koza, J. R. (1998). Exploiting the fruits
of parallelism: An implementation of parallel genetic programming that
achieves super-linear performance. Information Science Journal,
Elsevier, In press.
Keeton, K.K, Anderson, T.A., and Patterson, D.A. (1995).
LogP quantified: the case for low-overhead local area networks. Presented
at Hot Interconnects III: A symposium on High Performance Interconnects,
Stanford University.
Kitano, H., et al., (1997). RoboCup Synthetic Agent Challenge
97. In proceedings of International joint conference on artificial intelligence.
Nagoya, Japan, August, 1998.
Koza, J.R., (1992). Genetic Programming: on the programming
of computers by means of natural selection, Cambridge, Mass: MIT Press.
Koza, J.R., (1998). Personal communication.
Koza, J.R., Andre, D., Bennett, F. H, Keane, M. A. (1998).
Genetic programming III: automatic programming and circuit synthesis. Morgan
Kauffman. In Press.
Lumetta, S.S., Mainwaring, A.M., and Culler, D.E. (1997).
Multi-protocol active messages on a cluster of SMPs. In Proceedings of
the 1997 SuperComputing conference.
Noda, Itsuki, (1998) Robocup Soccer Server. Information
available from http://ci.etl.go.jp/~noda/soccer/server/index.html.
-
