Multi-level parallelism in automatically synthesizing soccer-playing programs for Robocup using genetic programming

 

David Andre
CS 267 – Course Project
dandre@cs.berkeley.edu
Spring 1998

To obtain this paper as a post script file, click here.

Introduction
Genetic Programming
Robocup
The implementation (stage 1)
The implementation (stage 2)
Empirical Results (single SMP case)
The multi-clump disk-based system
Empirical Results (multi-SMP case)
Conclusions and Future Work
Bibliography
Abstract:
Many of the various proposals for tomorrow's supercomputers have included clusters of multiprocessors as an essential component. However, when designing the systems of the future, it is important to insure that the nature of the parallelism provided matches up with some relevant and important set of algorithms. This project presents empirical program synthesis as an algorithm that can successfully exploit the multiple levels of interconnect present in an multi-SMP cluster system. When applying program synthesis techniques to difficult problems, it is often the case that two distinct levels of parallelism will emerge. First, many example programs must be tested -- and can often be tested in parallel. This matches up with the "slow" interconnect on a clump-based system. Second, the execution of a particular program can often be parallelized, especially if the program is complicated or requires interactions with a complex simulation. This level of parallelism, in contrast to the first, often requires fine-grained communication. Thus, this matches up with the "fast" level of the clump-based system.   In particular, this project presents a multi-level parallel system for the automatic program synthesis of soccer-playing agents for the Robocup simulator competition using genetic programming. The system utilizes both the fast shared-memory communication of the SMP system as well as a much slower mechanism for the inter-SMP communication. The system is benchmarked on a variety of configurations, and speedup curves are presented. Additionally, a simple LogP analysis comparing the performance of the designed system with a single-processor based NOW system is presented. Finally, the Robocup project is reviewed and the future work outlined.

 

Introduction:

Distributed workstations (such as in the NOW) are appealingly cheap due to their "commodity off the shelf" hardware, but proponents of the SMP shared memory architectures argue that the programming ease and low communication speeds of their systems more than justifies their expense.  In addition, some chip manufacturers (including Intel) are making dual and quad systems as commodity parts (although in considerably smaller volume than the single-processor boards). One partial compromise between these systems is to use clusters of SMP machines, such as the Berkeley CLUMPs, whereby applications can take advantage of both the high communication speeds and shared memory system of the SMP and the savings from using a comparatively low-speed network to connect the distributed SMPs.  In fact, depending on the memory configuration and the network utilized for a NOW-like system, the SMPs might actually be cheaper (measured per processor). Some authors have prognosticated that clusters of multiprocessors promise to be the supercomputers of the future [Lumetta, et al.1997], and several large scale systems have been built [ASCI, 1998]. Despite the appeal of this idea, it appears that finding applications that benefit from the hybrid approach is a non-trivial task. In many cases, algorithms must be modified to take advantage of the fast intra-node communication.
This project presents empirical program synthesis as a class of applications that can naturally benefit from the multi-level parallelism inherent in the multi-clump design. The essential idea behind empirical program synthesis is to generate programs to solve a problem by testing them in a test environment. For many problems, this testing can be time consuming. Often, an entire batch of programs must be tested before the program synthesis algorithm can generate a new set of programs to examine. Thus, there are naturally two distinct levels of potential parallelism: (1) the parallel execution of the set of programs under consideration, and (2) any parallelism possible in the execution of an individual program. Generally, if the execution of an individual is reasonably complicated, perhaps involving a simulation or many possible training cases, the bandwidth required for distributing the programs is not large. In this case, the first level of parallelism matches well with the "slow" communication on the network level interconnect. The parallelism required within an execution of a program is likely to require much more intense communication, and is thus well suited to the shared memory interconnect within an SMP.
To illustrate this idea, we consider a highly parallel, program synthesis method, called genetic programming (GP), that has shown both aptitude for program synthesis across a wide range of problems [Koza, 1992; Koza, Andre, Bennett, and Keane, 1998] and amenability to parallelization [Andre and Koza, 1996]. The problem domain we consider for this project is the Robocup soccer simulator, part of the international Robocup Challenge Project (Kitano et al., 1997). Robocup is described in more detail below. To demonstrate the system, we parallelized the Robocup soccer simulator (Noda, 1998) using threads, and implemented a simple algorithm to distribute games over a slow network. The results, both theoretical and empirical, indicate that using a shared memory system for the parallelization of the simulator provides excellent (even superlinear) speedup, while utilizing a disk-based distribution system incurs only a minor penalty.
In addition to the main goal of demonstrating a multi-level application suitable for a NOW of SMPs, this project had several other secondary goals. First, the author is involved in a project to automatically synthesize soccer agents for this year’s Robocup competition. The computational demands of GP applied to Robocup are not negligible, and in fact, without parallelization, the project has little chance of success. Thus, speeding up the process is an important goal of the project. Second, given the nature of the computing environment at most universities, many workstations are often idle overnight. Additionally, most graduate students do not have long- term uninterrupted access to the tens to hundreds of machines necessary for a realistic run of GP on Robocup. Thus, we implemented a simple file-based scheme to allow machines to be added and deleted during a run. This allows some flexibility in obtaining compute cycles that might not otherwise be available. Finally, we suggest that GP (and the general field of program synthesis) is a perfect candidate for an algorithm that can take advantage of whatever large parallel machines happen to exist in the future. Given the concern expressed by some [David Bailey’s lecture] that the machines of the future and the algorithms of today do not quite match, this finding is especially relevant.
The paper is organized as follows. First, we present a brief overview of genetic programming and the issues relating to its parallelization. Then, we present the Robocup soccer simulator and justification for why it is a good choice for illustrating the multi-clump model. Then we discuss the implementation of the code, following its progression through history. Then, we present the results of doing a LogP analysis of the Robocup code, and present empirical results for the single SMP system. Then, we describe the disk-based methodology for the distribution of games over a network of SMPs, and present some speedup curves for these results. Finally, we conclude and sketch out the future work for the Robocup project.

Figure 1. An example of the crossover operation being performed on Boolean trees.

Genetic Programming:

Genetic programming is a beam-search method for the automatic synthesis of computer programs that is based on the ideas of natural evolution. Essentially, GP is a method for the guided stochastic search of program space. Given a description of the problem to be solved (a fitness function) and a set of programmatic primitives, GP proceeds by first randomly generating a population of programs. Then, GP tests the fitness of each program, and then uses this information to recombine the programs using artificial analogs of natural crossover, mutation, and natural selection. The programs in GP are represented as program trees (which can be thought of as parse trees or as LISP programs. An example of the crossover operation is shown in Fig. 1. The computer programs in Fig. 1a,b are the parents for the crossover operation, and were chosen probabilistically based on their fitness on the particular Boolean function regression problem being solved. The crossover operation exchanges a subtree from one parent with a subtree from the other parent to create the two offspring. The heavy lines in Fig. 1 indicate the subtrees that were exchanged to create the two children shown in Fig. 1c,d.

The programmatic ingredients need not be as simple as those shown in Figure 1 – they can be as complex as iteration, memory, subroutines, and signal processing primitives. Typically, a run of genetic programming consists of the following steps:

Genetic programming has been analyzed as a form of stochastic beam search. It has been successful on a wide range of problems, including problems in molecular biology and analog circuit design.
One potential downside to GP is the large computational expense of solving a difficult problem. Often, performing a run of GP will involve the testing and thus execution of millions of programs. Naturally, parallelism has been utilized as a means to combat the large computational expense. In many cases, the method of parallelization has been to use the "deme" model. This model consists of a large number of separate sub-populations of programs, each evolving separately and occasionally exchanging a small number of programs (migration). Although such demes can be utilized on a serial machine, they are much more natural on a parallel system and provide a trivial means of parallelization, given that the amount of migration is reasonably small. Given that for most problems of reasonable complexity, this migration consists of only 8 KB/s, it is certainly the case that even simple interconnects can handle this inter-generational migration. Several successful systems have been built using this model (Andre and Koza, 1998), and show linear speedups by using a parallel hardware, and additional speedups from using the deme model. Based on the success of these techniques, a 1000 node machine of single processor DEC alpha boxes is being built this summer at Stanford (Koza, 1998).
Although it is certainly the case that GP is embarrassingly parallel, there are some limitations on its parallelization. First, the sub-populations often need to be of reasonable size in order to obtain any likelihood of success. Secondly, the length of time required for executing a given program is often a problem for difficult domains (such as many types of simulation, for example). For debugging and economic reasons, we often wish for GP to succeed within a reasonably length of time (say, 2 days). This then leads to the requirement that testing an individual program take on the order of 1-10 seconds or less. For many domains, this is simply impossible (molecular simulation, which can take days, even with some parallelization). For other domains, getting into the right order of magnitude can be achieved by shared memory parallelization.
 
Figure 2. Screen shot of the Robocup soccer server.

Robocup:

Robocup, the Robot World Cup Initiative, is a designed as a challenge to the artificial intelligence and intelligent robotics community to create teams capable of playing soccer, in both the real-world domain using robotic players and in the virtual domain using a soccer simulator.  The simulator league is by no means a toy problem -- the simulator is complex, modeling wind, rain, endurance, and shouting.  Last August, teams from around the world competed in Nagoya, Japan at the first Robocup annual competition.
The Robocup soccer simulator is a client server system where the server runs the simulation and there is a separate client for each soccer player.  The communication in the standard system takes place through UDP/IP sockets, and so the clients can be written in any system that has a UDP/IP interface.  The games run in real-time, and the length of the games for tournament play is set at 10 minutes per game.  The server runs the simulation through timesteps (100ms real time in the standard version), and the simulator will only execute one action (with a few exceptions) per agent per timestep.  The simulator handles the updating of the positions of the players on the field based on the commands of the player and sends the players noisy perceptual information.  The simulator of course also keeps track of the positions of the ball and the endurances of the players.  The clients can do any arbitrary calculation in their code,

but cannot communicate with one another except through the server.  The server sends perceptual information (visual, auditory (shouts of teammates), and proprioceptive (how tired I am, how many commands I did last timestep, etc) to the players on a different time scale (approximately every 150ms real time).  The clients can send motion commands such as (dash, turn, kick) as well as shouts to other players (shout). Given the complexity of the problem and the length of time of a game, attempting to use machine learning methods to automatically synthesize soccer playing programs is a compelling challenge.  Given that most machine learning methods require a great deal of experience (often thousands to millions of games), the time scale of 10 minutes per game is out of the question.  Thus, the question is to what degree can the length of simulation be reduced by parallelization?
The soccer server, as provided by the Robocup international community, allows the timescale of the simulation to be sped up. Thus, on a fast machine, games can run at 2-5 times real time without any change in the nature of simulation. However, the amount of speed-up is limited by several factors. The time scale must be slow enough for each player and the server (simulator) to accomplish all of its tasks. For example, on a Sun ultra (167Mhz), it turns out that the simulator may only be sped up by a factor of 2 or so before some time-steps take "longer" than they should because of the amount of computation required. Another problem is that the time-scale is not adaptive – some timesteps are more computationally intensive than others, but a fixed time-step size does not account for this and thus wastes time. Thus, in order to achieve significant speedup, we determined that we had to modify the simulator such that it could utilize adaptive time steps.

Learning Robocup players using GP is a perfect example for the clump-based model. The simulation of a game requires a fast, tight model of parallelization that allows for moderate communications. The algorithm for determining which games to run is highly distributed, and requires considerably less communication. This is thus well matched to the multi-level interconnect of a NOW of SMPs.

The implementation (stage 1):

The project of applying GP to the Robocup domain began by using the soccer server as provided by the international Robocup committee. In joint work with Astro Teller from Carnegie Mellon University, we modified our GP system to write out files in C that represented each individual in the population. These programs consisted of programmatic ingredients such as memory, arithmetic, subroutines, perceptual information, action-setting primitives, random constants, and conditionals. The system then compiled these programs, and ran them in the original soccer environment at 3 times real time on an 8-processor Sun Enterprise 5000 server (one of the clumps). Games would take on the order of 3 minutes. Using this system, we were only able to run with populations of size 200 (where a standard population size might be 10,000). Additionally, the compiles took too long. Also, the SMP provided less of a win here than we might have hoped, as the simulator dominates the computation (as we will see below).
The next step was to modify the soccer simulator code itself to run the players and server as a single process than took exactly the amount of time needed for computation. In other words, the system was changed from a real-time simulation to a computational simulation. The UDP/IP layer was replaced by a communications class in memory, and all timing information was no longer utilized. It was important in this process to double-check the behavior of players such that they performed similarly under both simulators. This ended up having some effect on performance – games now took on the order of 140 seconds, rather than around 180. To parallelize this code, the players were distributed out among different processors. We had to add several barriers and a mutex mechanism in the communications class to prevent race conditions. In this first version, the server would run on one processor, and the other processors would each have a fair number of players. This stage of the implementation required modifying the original 5000 lines of code to support parallelism (eliminating global variable, etc), and adding the code for the communication class and parallel control structure (barriers, etc). Overall, approximately 2000 lines of code were added during this stage. The thread barrier routines provided by David Martin for an earlier assignment were utilized.
 
Server Player Coach
SimulateStep GetPerceptualInfo GetPerceptualInfo
Send PerceptInfo to Players Calculate Action Calculate Action
  Send Command Send Command (if any)
Receive Commands    
Figure 3. Simple algorithm for parallelization.
At this point, the algorithm was as shown in figure 3. Simultaneously, the server can be performing a simulation step while the players and coach are receiving their perceptual information and determining their next actions. The server then sends out the perceptual information (that will be received on the next timestep), and then waits to receive commands from each of the players. The coach only executes once every 10 time steps, so it doesn’t factor into our analysis.
Figure 4.  Timing analysis of the first stage of implementation.
With this version working, we did a timing analysis. It turns out that very poor speedup was being obtained, because the server was taking 75% of the time. The total time (without parallelization) for executing all of the players was only 25% of the total time, and thus parallelization was bounded harshly by the execution time of the server. The results of the timing analysis are shown in figure 4. Sending the information to the players takes the majority of the time. This action consists of scanning the field for each player and determining which objects (other players, the ball, lines of the field, etc) can be seen based on which direction the player is looking. The server adds some noise to these percepts as well. Contrary to initial predictions, the simulation step of the server took very little time.

The implementation (stage 2):

Based on an analysis of the server code, it seemed that both the send and receive phases of the server were 22-way parallelizable. With some modification of the simulator code, the send phase could be performed independently by many different processors, assuming that each processor had access to the information about the field that was computed during the previous simulation step. Similarly, the receive-phase could be parallelized as long as the player information was then communicated to the processor performing the send step. This analysis suggested that the "players" be split up among the processor, where the notion of "player" is expanded to include the code for sending and receiving. Thus, nearly all of the code can be parallelized – only the 0.14% of the code contained in the simulator step remains a necessary serial step.

Actually making this change to parallelize the various pieces of the server required changing a relatively small amount of code, but it also required a careful search to root out all of the global and static variables hidden in the simulator. When multiple threads could be executing the same piece of code, globals and static variables could cause significant problems.

Pseudo code for the revised algorithm is shown below:

while (GameNotOver()) {
    barrier();
    If (I_am_server)
  SimStep();
    barrier();
    LoopOverPlayers(i from 0 to 21)
        SendTo(i);
    LoopOverPlayers(i from 0 to 21)
        CalculateAction(i);
    LoopOverPlayers(i from 0 to 21)
        ReceiveFrom(i);
}

The coach can more or less be treated as another player, and, as mentioned before, was excluded from our analysis. The communication required at each of the stages is as follows. Between the Receive step and the SimStep, the processor performing the SimStep must receive the information about the player that was modified during the Receive step. This consists of approximately 128 bytes from each player. Each processor, prior to performing the SendTo step, must receive the field information (essentially a list of the visible objects), comprising approximately 5K of data. Additionally, this data is not distributed concurrently in memory, and is sent as 80 separate messages. The perceptual information sent with the SendTo step comprises approximately 512B of data, but is only sent two out of every three time steps. The information corresponding to the ReceiveFrom step comprises only about 30B of memory. All in all, the communication requirements consist of sending approximately 101 messages totaling 11K bytes, although the exact totals vary depending on the system because of different minimum message sizes.

Using this information, and the measured number of microseconds required to perform a single step on a serial machine (43800), we performed a LogP analysis of the system for the SMPs, the NOW, and for a mythical ethernet system. We obtained the appropriate values for the parameters from papers by the NOW group at Berkeley [Lumetta et al, 1997; Keeton et al, 1995]. Combining these numbers into alpha/beta values, we then obtained the following timing curves.

Figure 5. Sequential LogP analysis of the SMP model.

Figure 6. Parallel (naïve) LogP analysis of the SMP model

Figure 6. "Realistic" analysis of the SMP model.

To model the communication strategy, we considered three cases: (1) sequential access, (2) parallel (naïve) access, and (3) "realistic" analysis. For the sequential access case, we assumed that only one set of messages could be on the network at a given time. For the parallel case, we assumed that all necessary messages could be on the network at a given time. For the realistic graph, we assumed that the SMP had the memory bandwidth to do all of its messages at a given time, but that the NOW and ethernet models performed some sort of tree broadcast to require no more than 2 serial accesses. Figures 6, 7, and 8 show the timing curves predicted by the 3 LogP models. As we will see, it turns out that the SMP actually probably doesn’t quite have the memory bandwidth to support 6 or 7 simultaneous messages.

Empirical Results (Single SMP):

After implementing the system and tweaking the compile flags to achieve slightly better performance, we ran several tests of the single SMP system. The timing/speedup curves are shown in Figure 7.

Figure 7. Time/speedup curves for a single game on a single SMP.

The achieved times are better than expected for 2-6 processors, but take a dip for 7 and 8 processors. It seems that cache benefits allow super-linear performance in the 2-6 case, but the communication overhead starts to overcome the benefit of additional processors above 6. This might be a function of the memory bandwidth restrictions on the CLUMPs, as not all of the memory slots are filled, or, it might be related to load balance problems. The load balance scheme utilized was static, and was certainly significant for the larger numbers of processors – some processors would have 3/2 the load of another processor, for example.

Figure 8. Percentage of time spent in each section of the code.

Figure 9. Absolute time spent in each section of the code.

Timing analyses were performed for the single SMP case as well. Figure 8 shows the relative time spent in each section of the code. The time spent in barriers goes up considerably as the number of processors increases. This mainly indicates time spent waiting for load imbalance, as the step time is relatively insignificant, even when the number of processors is large. The relative time spent on receiving doesn’t change much, but the time spent on players and on sending decreases as the time spent on barriers increases. Figure 9 shows the absolute time spent as the number of processors increases. The receiving time nearly vanishes, the sending time consistently decreases, and the player time decreases as well. In order to examine the issue of why the speedup tapers off for 7 & 8 processors, we investigated how each of the separate times changed as the number of processors increased. This is shown in figure 10.

Figure 10. Speedup/slowdown curves for each part of the code in the single SMP case.
In looking at the data represented in figure 10, we were surprised by several items. First, the speedup curve for the sending cycle is almost a super-linear straight line. This indicates no real slowdown caused by distributing the field information to the different processors. Secondly, the speedup curve for the player code is the curve that most dramatically tapers off at large numbers of processors. Noticeably, it matches the overall curve and corresponds with the increase in barrier time. Some of this time might well be due to problems with averaging times over processors with slightly different loads – but a similar curve is obtained even when viewing the minimum time spent in player code for each thread. Thus, some factor, perhaps related to code swapping in the instruction cache is causing a slowdown. The obvious suspect is communication, but there is very little communication directly in the player code – representing only the reading in of the messages from the server. Given that a corresponding slowdown is seen in both the receive code and the step code, this explanation makes some sense, but more detailed analysis of what is occurring on the SMP is required. Detailed cache and memory analysis would undoubtedly reveal the source of the problem.

A brief attempt at discerning the cause of the problem revealed that the time spent per player undergoes an interesting change as the number of processors increases. Figure 11 shows these results. The fact that the time spent per player decreases at first is probably due to the benefits of having multiple caches. The later slowdown could be due to the fact that the costs of bringing the general player code into the cache are distributed over fewer players per processor. This hypothesis could be checked with more careful memory analysis.

Figure 11. Average and minimum cumulative time spent per player.

 

The multi-clump system: a simple file-based interconnect.

In order to demonstrate the efficiency of the entire system, we designed a simple client-server file-based mechanism for distributing teams that need to be played in a game. Although clearly there exist faster and cleaner mechanisms for communicating (MPI, SplitC, or even UDP/IP, as examples), using a disk-based system was both easy to code and provides a very flexible system for performing runs on a hybrid collection of machines. Additionally, the file based system easily allows machines to be added or deleted during the course of a run of genetic programming. Also, the system allows for a diverse set of machines, such as the NOW (Berkeley’s network of single processor ultra-sparcs), the CLUMPs (the Sun Enterprise 5000 servers), and the MILLs (4-way Pentium SMPs), to be used together on a single run.
The system consists of a client that is the general GP engine, and a set of game-servers that each run games in the soccer simulator. Prior to starting the client, as many servers as desired are started. The servers each write a unique identifier into a common shared subdirectory. The client reads these identifiers when it starts up. The communication between client and each server is mediated by a flag file, which stores the current state of the interaction. When the client has written a pair of teams into a file named according to the server’s identifier, the server also writes a 1 to the file. The server waits for this flag to be set. Upon noticing it as set, it reads in the teams and starts a game. When the server is done, it writes the results of the game (some scoring information) into a file named according to the server’s identifier, and writes a 0 to the flag file. When the client notices this flag has been set, it reads in the scoring information and, if possible, writes a new set of teams that the given server can play.
The system allows for servers to be added or deleted during the course of a run. This is accomplished by either adding or deleting the identifier files from the common subdirectory. This system was implemented by writing 400 lines of code to perform the various file reading and writing tasks.

After obtaining some sample timing numbers, we calculated the effect of using the disk-based system for various numbers of processors. . We found that it took an average of 0.23 seconds to read and write the data (including the flags, the teams, and the scores) for a single game. We assumed a single disk system and serial access with no assumed overlap of disk writes and computation. In reality, it is unlikely that all processors would be waiting at the same time for the client to write files. Figure 12 shows the results of this analysis. When running with up to 64 nodes, this worst case analysis indicates that only 10% of the time of a single game would be spent on communication, still allowing for significant speedup from 64 processors. Clearly, to go beyond 128 processors, either multiple disk systems or the use of a faster communications system would be necessary.

Figure 12. Theoretical analysis of disk-time for multi-SMP system.

Empirical Results (Multi-SMP):

Using the file-based system described above, the performance of the system was tested by running a set of 6 3000-timestep games on a 1-SMP system, a 2-SMP system, and a 3-SMP system. The results are shown in figures 13 and 14. These figures both indicate that the multi-SMP model performs well, achieving slightly super linear speedup in one case and slightly below perfect linear speedup in the other set of runs. As predicted, the penalty for using the disk for communication at the level of distributing the teams causes very little penalty for small numbers of clumps.
Figure 13. Time/Speedup curves for the multi-SMP system. Circled data points utilize multiple clumps.
Figure 14. Time/Speedup curves for the multi-SMP system. This is not the same data as in figure 13, it represents a different set of runs. In this case, the results are slightly below linear.

To further test the scalability and reliability of the disk-based system, we tested its performance on the NOW, using only one thread per processor. This test is of course only of a single level of the ideal parallel system – the slow level, but it provides a test of scalability that cannot be easily performed with the available set of clumps. The results of this test are shown in figure 15.

Figure 15. Scalability test of the disk-based system on the NOW.

The results presented in figure 15 indicate that the speedup from utilizing multiple processors using our disk-based mechanism allows for nearly linear speedup. In fact, the performance at 16 processors is about 3.5 seconds off of perfect linear speedup, which is approximately what was predicted by the disk-based model. In addition, load balance was somewhat of an issue – some processors appeared to be faster than others, which can account for some of the failure to speedup. The performance at 8 and 4 processors appears to be slightly super-linear, but this is probably an artifact of random variance in the run times. More tests should be performed here to validate the numbers and diminish the variance of the results.

Conclusions and Future work

The results presented in this paper represent a success for most of the goals set forth for the parallel-programming aspect of the Robocup project. By utilizing a multi-level parallel architecture, we can achieve scalable and nearly linear speedup. Based on the results in this paper, it appears that by using a 3-CLUMP system, we can achieve performance of running 72 games in 160 seconds, compared with a single game using our old system. Thus, this work provides a critical speedup for the Robocup project.

However, there are several issues that could benefit from further work on the parallel algorithm front. First, a better understanding of the source of slowdown at 7 and 8 processors might enable further speedups. Also, examining the reason for the large increase in barrier times could conceivably attain greater performance. Second, a cleaner and better system for achieving the network level parallelism should be investigated. Finally, the "slow" network inter-SMP parallelism exploited here was at the individual program level. It might worthwhile and important to add in the sub-population layer of parallelism often exploited by genetic programming approaches.

For the Robocup project, the future work entails utilizing this system as much as possible to attempt to learn soccer playing programs that will be highly competitive in this year’s Robocup tournament. Also, we expect to port the system to the 70+ existing nodes of the alpha-system being built at Stanford by Dr. John Koza.

Finally, it should be noted that the main point of this paper is that program synthesis (and not just GP) working on a complex problem (not just Robocup) can benefit from and make use of a multi-level parallel system such as a network of SMPs.

Bibliography

Accelerated Strategic Computing Initiative. (1996). The Red Hardware. Information available at http://mephisto.ca.sandia.gov/TFLOP/sc96/node2.html.

Andre, D., and Koza, J. R. (1998). Exploiting the fruits of parallelism: An implementation of parallel genetic programming that achieves super-linear performance. Information Science Journal, Elsevier, In press.

Keeton, K.K, Anderson, T.A., and Patterson, D.A. (1995). LogP quantified: the case for low-overhead local area networks. Presented at Hot Interconnects III: A symposium on High Performance Interconnects, Stanford University.

Kitano, H., et al., (1997). RoboCup Synthetic Agent Challenge 97. In proceedings of International joint conference on artificial intelligence. Nagoya, Japan, August, 1998.

Koza, J.R., (1992). Genetic Programming: on the programming of computers by means of natural selection, Cambridge, Mass: MIT Press.

Koza, J.R., (1998). Personal communication.

Koza, J.R., Andre, D., Bennett, F. H, Keane, M. A. (1998). Genetic programming III: automatic programming and circuit synthesis. Morgan Kauffman. In Press.

Lumetta, S.S., Mainwaring, A.M., and Culler, D.E. (1997). Multi-protocol active messages on a cluster of SMPs. In Proceedings of the 1997 SuperComputing conference.

Noda, Itsuki, (1998) Robocup Soccer Server. Information available from http://ci.etl.go.jp/~noda/soccer/server/index.html.