Run 5 learners and 10 actors in a cluster

The setup and execution is a complex procedure. If not clear, please report an issue in the issue tracker

Install dependencies for the code

The complete system requires several dependencies. The dependencies are:

An installation script is provided in universe-starter-agent/install/install.sh

Modify the cluster configuration

The multiple-learner component is implemented with distributed tensorflow.

The learner configuration is hard coded in universe-starter-agent/ccvl_cluster_spec.py. Modify this file according to your cluster spec.

In the following document, the parameter server will be ccvl2. And the other five machines ccvl1-5 will run learners. The parameter server is responsible for coordinating weights between learners.

Start the parameter server

In the machine for parameter server, ccvl2, start the parameter server with

cd universe-starter-agent/
sh run_ps.sh

universe-starter-agent/run_ps.sh will start ps_run.py with proper parameters.

Start five learners

In each machine from ccvl1-5, start the learner with

sh run_learner.sh 0

The number 0 is the worker id for ccvl1, number 1 will be the id for ccvl2.

The learner will wait until all actors are connected.

Start all actors and start learning

Start docker which contains the neonrace virtual environment. This script will start two docker containers, each running a neonrace virtual environment.

sh run_docker.sh

Start the actor code with

sh run_actor.sh

run_actor.sh will run actor.py with proper parameters.

Check the learning result

The learning procedure can be visualized by connecting to the docker container through vnc.

Use TurboVNC client to connect to ccvl1.ccvl.jhu.edu:13000. Change the url to your own configuration.

The learnt models will be stored in train-log folder. Use tensorboard to visualize the result, or use the code in neonrace to use trained model.