Population-Based Training¶
Sample Factory contains an implementation of the Population-Based Training algorithm. See PBT paper and original Sample Factory paper for more details.
PBT is a hyperparameter optimization algorithm that can be used to train RL agents. Instead of manually tuning all hyperparameters, you can let an optimization method do it for you. This can include not only learning parameters (e.g. learning rate, entropy coefficient), but also environment parameters (e.g. reward function coefficients).
It is common in RL to have a sophisticated (shaped) reward function which guides the exploration process. As a result such reward function can distract the agent from the actual final goal.
PBT allows you to optimize with respect to some sparse final objective (which we call "true_objective") while still using a shaped reward function. Theoretically the algorithm should find hyperparameters (including shaping coefficients) that lead to the best final objective. This can be, for example, directly optimizing for just winning a match in a multiplayer game, which would be very difficult to do with just regular RL because of the sparsity of such objective. This type of PBT algorithm is implemented in the FTW agent by DeepMind.
Algorithm¶
PBT works similar to a genetic algorithm. A population of agents is trained simultaneously with roughly the following approach:
- Each agent is assigned a set of hyperparameters (e.g. learning rate, entropy coefficient, reward function coefficients, etc.)
- Each agent is trained for a fixed number of steps (e.g. 5M steps)
- At the end of this meta-training epoch, the performance of all agents is ranked:
- Agents with top K % of performance are unchanged, we just keep training them
- Agents with bottom K % of performance are replaced by a copy of a random top-K % agent with mutated hyperparameters.
- Agents in the middle keep their weights but also get mutated hyperparameters.
- Proceed to the next meta-training epoch.
Current version of PBT is implemented for a single machine. The perfect setup is a multi-GPU server that can train multiple agents at the same time. For example, we can train a population of 8 agents on a 4-GPU machine, training 2 agents on each GPU.
PBT is perfect for multiplayer game scenarios where training a population of agents against one another yields much more robust results compared to self-play with a single policy.
Providing "True Objective" to PBT¶
In order to optimize for a true objective, you need to return it from the environment.
Just add it to the info
dictionary returned by the environment at the last step of the episode, e.g.:
def step(self, action):
info = {}
...
info['true_objective'] = self.compute_true_objective()
return obs, reward, terminated, truncated, info
In the absence of true_objective
in the info
dictionary, PBT will use the regular reward as the objective.
Learning parameters optimized by PBT¶
See population_based_training.py
:
HYPERPARAMS_TO_TUNE = {
"learning_rate",
"exploration_loss_coeff",
"value_loss_coeff",
"max_grad_norm",
"ppo_clip_ratio",
"ppo_clip_value",
# gamma can be added with a CLI parameter (--pbt_optimize_gamma=True)
}
During training current learning parameters are saved in f"policy_{policy_id:02d}_cfg.json"
files in the experiment directory.
Optimizing environment parameters¶
Besides learning parameters we can also optimize parameters of the environment with respect to some "true objective".
In order to do that, your environment should implement RewardShapingInterface
interface in addition to gym.Env
interface.
class RewardShapingInterface:
def get_default_reward_shaping(self) -> Optional[Dict[str, Any]]:
"""Should return a dictionary of string:float key-value pairs defining the current reward shaping scheme."""
raise NotImplementedError
def set_reward_shaping(self, reward_shaping: Dict[str, Any], agent_idx: int | slice) -> None:
"""
Sets the new reward shaping scheme.
:param reward_shaping dictionary of string-float key-value pairs
:param agent_idx: integer agent index (for multi-agent envs). Can be a slice if we're training in batched mode
(set a single reward shaping scheme for a range of agents)
"""
raise NotImplementedError
Any parameters in the dictionary returned by get_default_reward_shaping
will be optimized by PBT.
Note that although the dictionary is called "reward shaping", it can be used to optimize any environment parameters.
It is important that none of these parameters should directly affect the objective calculation, otherwise all PBT will do is increase the coefficients all the way to infinity.
An example of how this can be used. Suppose your shaped reward function contains a term for picking up a weapon in a game like Quake or VizDoom.
If the true objective is 1.0
for winning the game and 0.0
otherwise then PBT can optimize these weapon preference coefficients to maximize success.
But if true objective is not specified (so just the env reward itself is used as objective), then you can just increase
the coefficients to increase the reward unboundedly.
Configuring PBT¶
Please see Configuration parameter reference. Parameters with --pbt_
prefix are related to PBT.
Use --with_pbt=True
to enable PBT. It is important also to set --num_policies
to the number of agents in the population.
Command-line examples¶
Training on DMLab-30 with a 4-agent population on a 4-GPU machine:
python -m sf_examples.dmlab.train_dmlab --env=dmlab_30 --train_for_env_steps=10000000000 --algo=APPO --gamma=0.99 --use_rnn=True --num_workers=90 --num_envs_per_worker=12 --num_epochs=1 --rollout=32 --recurrence=32 --batch_size=2048 --benchmark=False --max_grad_norm=0.0 --dmlab_renderer=software --decorrelate_experience_max_seconds=120 --encoder_conv_architecture=resnet_impala --encoder_conv_mlp_layers=512 --nonlinearity=relu --rnn_type=lstm --dmlab_extended_action_set=True --num_policies=4 --pbt_replace_reward_gap=0.05 --pbt_replace_reward_gap_absolute=5.0 --pbt_period_env_steps=10000000 --pbt_start_mutation=100000000 --with_pbt=True --experiment=dmlab_30_resnet_4pbt_w90_v12 --dmlab_one_task_per_worker=True --set_workers_cpu_affinity=True --max_policy_lag=35 --pbt_target_objective=dmlab_target_objective --dmlab30_dataset=~/datasets/brady_konkle_oliva2008 --dmlab_use_level_cache=True --dmlab_level_cache_path=/home/user/dmlab_cache
PBT for VizDoom (8 agents, 4-GPU machine):
python -m sf_examples.vizdoom.train_vizdoom --env=doom_deathmatch_bots --train_for_seconds=3600000 --algo=APPO --use_rnn=True --gamma=0.995 --env_frameskip=2 --num_workers=80 --num_envs_per_worker=24 --num_policies=8 --batch_size=2048 --res_w=128 --res_h=72 --wide_aspect_ratio=False --with_pbt=True --pbt_period_env_steps=5000000 --experiment=doom_deathmatch_bots