AWS DeepRacer
Developer Guide

This is prerelease documentation for a service in preview release. It is subject to change.

Systematically Tune Hyperparameters for Optimal Training Performances

The following table shows the hyperparameters that can be adjusted to tune the performance of training using a supported algorithm:

Algorithm-specific hyperparameters and their effects

Algorithm Hyperparameters Description
PPO

Batch size

The number of images used for each gradient descent update. A larger Batch size value leads to more stable updates, but slower training.

Required

Yes

Valid values

Positive integer of (32, 64, 128, 256, 512)

Default value

32

PPO

Number of epochs

The number of passes through the experience buffer during gradient descent. A smaller num_epoch value promotes more stable updates, but slower training. A larger Number of epochs value is acceptable when the batch size is large.

Required

No

Valid values

Positive integer between [3 - 10]

Default value

3

PPO

Learning rate

Controls how much each gradient descent contributes to an update. A larger value can increase the training speed, but may cause the expected rewards not to convergence if it's too large.

Required

No

Valid values

Real number between [10-5- 10-3]

Default value

0.001

PPO

Exploration

A type of exploration used in training reinforcement learning models. Use the categorical exploration (CategoricalParameters) for a discrete action space and use the epsilon exploration (EpsilonGreedy) for a continuous action space:

  • The categorical exploration takes action according to the probability distribution of the action space of the policy network.

  • The epsilon exploration takes action at random with an epsilon distribution. The initial value is 1 and then linearly decreased to 0.1 over X steps, where X is typically between 10,000 and 100,000.

As the training progresses, the exploration helps prevent the neural network from being trapped in parts of the action space (local maxima). As the result, the neural network learns what actions to take with more certainty and confidence, as the training progresses.
Required

No

Valid values

String literal of CategoricalParameters or EpsilonGreedy

Default value

EpsilonGreedy

PPO Entropy

A degree of the randomness the agent takes action with. The larger the entropy, the more random actions the agent will take for exploration.

Required

No

Valid values

Real number between [10-4- 10-2]

Default value

0.5

PPO Discount factor

A factor specifies how much of the future reward contributes to the expected rewards. The larger the Discount factor value is, the farther out contributions the agent considers to take an action. If it is 0, only the immediate reward is considered. If it is 1, the full-range future rewards are included. As a simplified example, if the agent takes order of 100 steps to take a turn, then discount factor of 0.99 is a good value, if it takes 1000 steps, then 0.999 is a good value.

Required

No

Valid values

Real number between [0- 1]

Default value

0.999

PPO Loss type

Type of the objective function for optimization. A good training algorithm should make incremental changes to the agent's strategy so that it gradually transitions from taking random actions to taking strategic actions to increase reward. But if it makes too big a change then the training becomes unstable and the agent ends up not learning. The Huber loss and Mean squared error loss types behave similarly for small updates. But as the updates become larger, Huber loss takes smaller increments compared to Mean squared error loss. When you have convergence problems, use the Huber loss type. When convergence is good and you want to train faster, use the Mean squared error loss type.

Required

No

Valid values

(Huber loss, Mean squared error loss)

Default value

Huber loss

PPO Number of episodes between each training

This parameter tells the agent how frequently it should train it's neural networks. If this value is low, the agent gets a small amount of experience between updates. For problems that are easy to solve, a small number suffices and learning is fast. For more complex problems, it is better to gather more experience so that the agent observes variations of the effect of its actions. Learning will be slower in this case but more stable.

Recommended values are (10, 20, 40).

Required

No

Valid values

[1 - 1000]

Default value

20