AWS DeepRacer Training Algorithm - AWS DeepRacer

AWS DeepRacer Training Algorithm

AWS DeepRacer uses the Proximal Policy Optimization (PPO) algorithm to train the reinforcement learning model. PPO uses two neural networks during training: a policy network and a value network. The policy network (also called actor network) decides which action to take given an image as input. The value network (also called critic network) estimates the cumulative reward we are likely to get given the image as input. Only the policy network interacts with the simulator and gets deployed to the real agent, namely an AWS DeepRacer vehicle.

                Image: Graphical illustration of PPO.

Below we explain how the actor and critic work together mathematically.

PPO is a derivative of the policy gradient method. In the most basic form, the policy gradient method trains the agent to move along a track by searching for the optimal policy π*(a|s; θ*). The optimization aims at maximizing a policy score function J) that can be expressed in terms of the immediate reward r(s,a) of taking action (a) in state (s) averaged over the state probability distribution ρ(s) and the action probability distribution (π(a|s; θ)):

                Image: Policy score function

The optimal policy, as represented by the optimal policy network weights θ*, can be expressed as follows:

                Image: Optimal policy

The maximization can proceed by following the policy gradient ascent over episodes of training data (s, a, r):

                Image: Policy gradient ascent update

where α is known as the learning rate and θJτ) is the policy gradient with respect to θ evaluated at step τ.

In terms of the total future reward:

                Image: Total future reward

where γ is the discount factor ranging between 0 and 1, and τ maps to an experience (sτ, aτ, rτ) at step τ, and the summation includes experiences in an episode that starts from time t = 0 and ends at time t = H when the agent goes off-track or reaches to the finish line, the score function becomes the expected total future reward averaged over the policy distribution π across many episodes of experiences:

                Image: Policy score function over experiences

From this definition of J(θ), the policy weight updates can be expressed as follows:

                Image: Policy gradient over experiences.

Here, averaging over π is approximated by sample averaging over N of episodes each of which consists of possibly unequal H number of experiences.

The update rule for a policy network weight then becomes:

                Image: policy weight update over experiences.

The policy gradient method outlined above is of limited utility in practice, because the score function R(si,t, ai,t) has high variance as the agent can take many different paths from a given state. To get around this, one uses a critic network (ϕ) to estimate the score function. To illustrate this, let Vϕ(s) the value of the critic network, where s describes a state and ϕ the value network weights. To train this value network, the estimated value (yi,t) of state s at step t in episode i is estimated to be the immediate reward taking action ai,t at state si,t plus the discounted total future value of the state s at the next step t+1 in the same episode:

                Image: Estimated value network

The loss function for the value network weights is:

                Image: Value network loss function

Using the estimated values, the policy gradient for updating the policy network weights θ becomes:

                Image: Actor-Critic policy gradient

This formulation changes the policy network weights such that it encourages actions that give higher rewards than prior estimate and discourages otherwise.

Every reinforcement learning algorithm needs to balance between exploration and exploitation. The agent needs to explore the state and action space to learn which actions lead to high rewards in unexplored state space. The agent should also exploit by taking the actions that leads to high rewards so that the model converges to a stable solution. In our algorithm, the policy network outputs the probability of taking each action and during training the action is chosen by sampling from this probability distribution (e.g. an action with probability 0.5 will be chosen half the time). During evaluation, the agent picks with action with the highest probability.

In addition to the above actor-critic framework, PPO uses importance sampling with clipping, adds a Gauss-Markov noise to encourage exploration and uses generalized advantage estimation. To learn more, see the original paper.