Taking advantage of both SMP and distributed computing:  Parallelizing the automatic creation of robocup soccer programs on the CLUMPs  

David Andre

CS 267 -- Project Proposal

 

Introduction:

    Distributed workstations (such as in the NOW) are appealingly cheap due to their off the shelf hardware, but proponents of the SMP shared memory architectures argue that the programming ease and low communication speeds of their systems more than justifies their expense.  One partial compromise between these systems is to use clusters of SMP machines, such as the Berkeley CLUMPs, whereby applications can take advantage of both the high communication speeds and shared memory system of the SMP and the savings from using a comparatively low-speed network to connect the distributed SMPs.  Despite the appeal of this idea, it appears that few researchers have found applications that truly benefit from the hybrid approach.
    This projects aims to demonstrate such an application.   
    Robocup, the Robot World Cup Initiative, is a challenge to the artificial intelligence and intelligent robotics community to create teams capable of playing soccer, in both the real-world domain using robotic players, and in the virtual domain using a soccer simulator.  The simulator league is by no means a toy problem -- the simulator is complex, modelling wind, rain, endurance, etc. accurately.  The teams consist of 11 separate soccer client programs that communicate only through the server.  The games run in real-time, and the length of the games for tournament play is set at 10 minutes per game.  Given the complexity of the problem and the slowness of the simulations, attempting to use machine learning methods to automatically synthesize soccer playing programs is a compelling challenge.  Given that most machine learning methods require a great deal of experience (often thousands to millions of games), the time scale of 10 minutes per game is out of the question.  Thus, the question is to what degree can the length of simulation be reduced by parallelization?
    Genetic programming is a powerful technique for automatic program synthesis.  Essentially, genetic programming is a parallel search technique modelled after natural evolution that automatically synthesizes computer programs to solve a given problem.   Genetic programming has proven quite effective in recent years, producing solutions to a variety of different problems in different fields, including machine learning, mechanical engineering, analog circuit design, molecular biology, pattern recognition, theorem proving, and cellular automata. Genetic programming is a technique that seems to be able to solve harder problems given more computational resources.  Numerous approaches to parallelizing genetic programming have been taken, and most have been quite successful, often showing linear or even super-linear speed up. (Look here for more information on GP and parallel GP).  However, most successful applications of genetic programming have utilized simulations or program evaluations that take on the order of a second of compute time on a modern processor.  Even then, most applications of GP on difficult problems have still required considerable (hours to days) computational effort.  Thus, without improving the performance of the simulator, it seems hopeless to attempt to learn successful soccer agents using genetic programming.

The ideal distributed SMP application:

    In order to truly show off the benefits of a distributed SMP system, we require an application that can take advantage of both layers of parallelism.  The application should have some parallelism over a tight time scale involving a relatively large amount of communication, and some parallelism utilizing a slower time scale and presumably a smaller amount of data.   As we will see, learning robocup players using parallel genetic programming should fit this bill. Essentially, at the level of speeding up the Robocup simulator, we can use the SMP to parallelize both the execution of the player agents and the server itself.  At the level of evaluating a population of teams, we can use the slower communication facilities of the distributed network.

Robocup:

 

    The robocup soccer simulator (available here) is a client server system where the server runs the simulation and there is a separate client for each soccer player.  The communication in the standard system takes place through UDP/IP sockets, and so the clients can be written in any system that has a UDP/IP interface.  The simulator handles the updating of the positions of the players on the field based on the commands of the player and sends the players noisy perceptual information.  The simulator of course also keeps track of the positions of the ball and the endurances of the players.  The clients can do any arbitrary calculation in their code, but cannot communicate with one another except through the server.
    The server runs the simulation through timesteps (100ms real time in the standard version), and the simulator will only execute one action (with a few exceptions) per agent per timestep.  The server sends perceptual information (visual, auditory(shouts of teammates), and proprioceptive (how tired I am, how many commands I did last timestep, etc) to the players on a different time scale (approximately every 300ms real time).  The timing of the agents is critical and is a complication for this project. The clients can send motion commands such as (dash, turn, kick) as well as shouts to other players (shout).
    To successfully speed up the system, we will probably need to use something other than UDP/IP communication (although we may need to simulate the possibility of dropped packets).  A large portion of the time of execution is spent in the simulator, and thus we probably can achieve further speedups by parallelizing the simulator itself.  The simulator is mostly particle based, and thus we can apply many of the techniques that we have learned.  Of course, we can run each of the clients in parallel, but probably can achieve speedups by managing the context switching ourselves rather than trusting the OS to do this for us.

Issues in parallelizing Genetic Programming for Robocup:

Although parallelizing genetic programming is fairly straightforward (as discussed here), there are several assumptions might potentially limit its applicability. First, it is known that the population size of a breeding group (deme) has to be of a certain minimum size, and that deme population sizes of at least 1000 are typically desirable. These size requirements allow the population within a deme to maintain enough diversity to continue to be an efficient search. Second, it is assumed that runs require approximately 50 generations, and must take place in under two days of real time (thus approximately 1 generation must be processed per hour). For a population size of 1000, this means that each individual can only take 3.6 seconds to fully evaluate. Thus, the complexity of the fitness evaluation is fairly harshly limited. Even assuming that we can get by with 300 individuals per deme, we still have only about 10 seconds to evaluate each individual.    
    Thus, in order to utilize GP for difficult problems, some sort of parallelization at the fitness evaluation level is required. Although this has been somewhat explored in the field of genetic programming (our work on evolving FPGA configurations, for example), it is not the predominant method nor is it a solved problem. Another possibility is to spread the demes out over multiple processors and do the reproductive operations in parallel. This is clearly feasible, although considerably more communication intensive than having only a single deme per processor.

Details:

    In order to test the distributed SMP system, we will compare with systems (both in theory and with experimental results) that utilize only one or the other of the technologies (i.e. either SMP or distributed alone).  We expect that our system will allow for the fastest run times for the parallel system.

Reasons why distributed alone won't work

Reasons why SMP alone won't work

Known issues, possible problems: