Systematically Tune Hyperparameters for Optimal Training Performances
The following table shows the hyperparameters that can be adjusted to tune the performance of training using a supported algorithm:
Algorithmspecific hyperparameters and their effects
Algorithm  Hyperparameters  Description 

PPO 
Batch size 
The number of images used for each gradient descent update. A larger Batch size value leads to more stable updates, but slower training. The batch size determines the number of steps, randomly sampled from the most recent experience, that will be used to train the deep neural network. It corresponds to the amount of experience used to update the network.

PPO 
Number of epochs 
The number of passes through the experience buffer during
gradient descent. A smaller This is the number of times that the algorithm runs through the batch and update the network weights when training is performed each time.

PPO 
Learning rate 
Controls how much each gradient descent contributes to an update. A larger value can increase the training speed, but may cause the expected rewards not to convergence if it's too large.

PPO 
Exploration 
A method used to provide the tradeoff between exploitation
and exploration, which determines how much the training follows
model prescriptions (exploitation) or proceeds at random
(exploration). Depending on the type of action space, you can
use the categorical exploration
(

PPO  Entropy 
A degree of the randomness added to the probability distribution the agent takes action with. The larger the entropy, the more random actions the agent will take for exploration.

PPO  Discount factor 
A factor specifies how much of the future reward contributes to the expected rewards. The larger the Discount factor value is, the farther out contributions the agent considers to take an action. If it is 0, only the immediate reward is considered. If it is 1, the fullrange future rewards are included. As a simplified example, if the discount factor is 0.9, the agent takes an order of 10 future steps to make a move. If the discount factor is 0.99, the agent takes an order of 100 future steps to make a move. And if the discount factor is 0.999, the agent takes an order of 1000 steps to make a move.

PPO  Loss type 
Type of the objective function used to update the network weights. A good training algorithm should make incremental changes to the agent's strategy so that it gradually transitions from taking random actions to taking strategic actions to increase reward. But if it makes too big a change then the training becomes unstable and the agent ends up not learning. The Huber loss and Mean squared error loss types behave similarly for small updates. But as the updates become larger, Huber loss takes smaller increments compared to Mean squared error loss. When you have convergence problems, use the Huber loss type. When convergence is good and you want to train faster, use the Mean squared error loss type.

PPO  Number of episodes between each training 
This parameter determines how much experience, in the form of episodes, to obtain before a training is run. After each training step, the newly trained model is used to obtain more episodes and the process continues iteratively. Recommended values are
