The model itself is quite simple DQN Agent with LinearAnnealedPolicy. The most important layer is Dense 512 neuron internal layer. It's responsible for understanding of the current situation during landing. Next small, dense layer on top of it is responsible for final decisions related to lunar lander actions (steering the engines).
Here's how it can be instantiated:
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
memory = SequentialMemory(limit=1000000, window_length=WINDOW_LENGTH)
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr='eps', value_max=1., value_min=.1, value_test=.05, nb_steps=1000000)
dqn = DQNAgent(model=model, nb_actions=nb_actions, policy=policy, memory=memory, nb_steps_warmup=50000, gamma=.99, target_model_update=10000, train_interval=4, delta_clip=1.)
Training can take a lot of time. In this example I used 3.5 million steps.
The outcome gives quite reasonable behavior for lunar lander.
Note that depending on environment, learning can take less time (if environement is not very tricky). In this case I found that difficulties are related to sensitivity of the last phase of landing (touch down). It took some time for the model to figure this part out.
As a side note, OpenAI provides lunar lander Agent using optimal trajectory. In this example, the Agent polls environment to figure out actions that give highest output of Q function.