Recently I was playing around with OpenAI Gym and Keras Reinforcement Learning library (keras-rl). I was able to train an AI Agent for a task of landing on the Moon.
OpenAI Gym provides all sorts of different environments to explore using AI bots. One of them is lunar lander, which I'll focus on here.
Keras on the other hand is a high level library built on top of TensorFlow (or Theano). It provides mechanisms for constructing deep learning models easily.
Creating a new environment using OpenAI Gym is as easy as this:
env = gym.make('LunarLander-v2')
Here's how you can add video recorder of progress during training:
env = gym.wrappers.Monitor(env,
'recording',
resume=True,
video_callable=lambda count: count % record_video_every == 0
)
The model itself is quite simple DQN Agent with LinearAnnealedPolicy. The most important layer is Dense 512 neuron internal layer. It's responsible for understanding of the current situation during landing. Next small, dense layer on top of it is responsible for final decisions related to lunar lander actions (steering the engines).
Here's how it can be instantiated:
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())
memory = SequentialMemory(limit=1000000, window_length=WINDOW_LENGTH)
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr='eps', value_max=1., value_min=.1, value_test=.05, nb_steps=1000000)
dqn = DQNAgent(model=model, nb_actions=nb_actions, policy=policy, memory=memory, nb_steps_warmup=50000, gamma=.99, target_model_update=10000, train_interval=4, delta_clip=1.)
dqn.compile(Adam(lr=.00025), metrics=['mae'])
Here's a summary of the model:
Layer (type) Output Shape Param # Connected to 7
flatten_1 (Flatten) (None, 8) 0 flatten_input_1[0][0]
dense_1 (Dense) (None, 512) 4608 flatten_1[0][0]
activation_1 (Activation) (None, 512) 0 dense_1[0][0]
dense_2 (Dense) (None, 4) 2052 activation_1[0][0]
activation_2 (Activation) (None, 4) 0 dense_2[0][0]
Total params: 6,660
Trainable params: 6,660
Non-trainable params: 0
Training can take a lot of time. In this example I used 3.5 million steps.
The outcome gives quite reasonable behavior for lunar lander.
Note that depending on environment, learning can take less time (if environement is not very tricky). In this case I found that difficulties are related to sensitivity of the last phase of landing (touch down). It took some time for the model to figure this part out.
As a side note, OpenAI provides lunar lander Agent using optimal trajectory. In this example, the Agent polls environment to figure out actions that give highest output of Q function.