Skip to content

Population-Based Training

Sample Factory contains an implementation of the Population-Based Training algorithm. See PBT paper and original Sample Factory paper for more details.

PBT is a hyperparameter optimization algorithm that can be used to train RL agents. Instead of manually tuning all hyperparameters, you can let an optimization method do it for you. This can include not only learning parameters (e.g. learning rate, entropy coefficient), but also environment parameters (e.g. reward function coefficients).

It is common in RL to have a sophisticated (shaped) reward function which guides the exploration process. As a result such reward function can distract the agent from the actual final goal.

PBT allows you to optimize with respect to some sparse final objective (which we call "true_objective") while still using a shaped reward function. Theoretically the algorithm should find hyperparameters (including shaping coefficients) that lead to the best final objective. This can be, for example, directly optimizing for just winning a match in a multiplayer game, which would be very difficult to do with just regular RL because of the sparsity of such objective. This type of PBT algorithm is implemented in the FTW agent by DeepMind.

Algorithm

PBT works similar to a genetic algorithm. A population of agents is trained simultaneously with roughly the following approach:

  • Each agent is assigned a set of hyperparameters (e.g. learning rate, entropy coefficient, reward function coefficients, etc.)
  • Each agent is trained for a fixed number of steps (e.g. 5M steps)
  • At the end of this meta-training epoch, the performance of all agents is ranked:
    • Agents with top K % of performance are unchanged, we just keep training them
    • Agents with bottom K % of performance are replaced by a copy of a random top-K % agent with mutated hyperparameters.
    • Agents in the middle keep their weights but also get mutated hyperparameters.
  • Proceed to the next meta-training epoch.

Current version of PBT is implemented for a single machine. The perfect setup is a multi-GPU server that can train multiple agents at the same time. For example, we can train a population of 8 agents on a 4-GPU machine, training 2 agents on each GPU.

PBT is perfect for multiplayer game scenarios where training a population of agents against one another yields much more robust results compared to self-play with a single policy.

Providing "True Objective" to PBT

In order to optimize for a true objective, you need to return it from the environment. Just add it to the info dictionary returned by the environment at the last step of the episode, e.g.:

def step(self, action):
    info = {}
    ...
    info['true_objective'] = self.compute_true_objective()
    return obs, reward, terminated, truncated, info

In the absence of true_objective in the info dictionary, PBT will use the regular reward as the objective.

Learning parameters optimized by PBT

See population_based_training.py:

HYPERPARAMS_TO_TUNE = {
    "learning_rate",
    "exploration_loss_coeff",
    "value_loss_coeff",
    "max_grad_norm",
    "ppo_clip_ratio",
    "ppo_clip_value",
    # gamma can be added with a CLI parameter (--pbt_optimize_gamma=True)
}

During training current learning parameters are saved in f"policy_{policy_id:02d}_cfg.json" files in the experiment directory.

Optimizing environment parameters

Besides learning parameters we can also optimize parameters of the environment with respect to some "true objective".

In order to do that, your environment should implement RewardShapingInterface interface in addition to gym.Env interface.

class RewardShapingInterface:
    def get_default_reward_shaping(self) -> Optional[Dict[str, Any]]:
        """Should return a dictionary of string:float key-value pairs defining the current reward shaping scheme."""
        raise NotImplementedError

    def set_reward_shaping(self, reward_shaping: Dict[str, Any], agent_idx: int | slice) -> None:
        """
        Sets the new reward shaping scheme.
        :param reward_shaping dictionary of string-float key-value pairs
        :param agent_idx: integer agent index (for multi-agent envs). Can be a slice if we're training in batched mode
        (set a single reward shaping scheme for a range of agents)
        """
        raise NotImplementedError

Any parameters in the dictionary returned by get_default_reward_shaping will be optimized by PBT. Note that although the dictionary is called "reward shaping", it can be used to optimize any environment parameters.

It is important that none of these parameters should directly affect the objective calculation, otherwise all PBT will do is increase the coefficients all the way to infinity.

An example of how this can be used. Suppose your shaped reward function contains a term for picking up a weapon in a game like Quake or VizDoom. If the true objective is 1.0 for winning the game and 0.0 otherwise then PBT can optimize these weapon preference coefficients to maximize success. But if true objective is not specified (so just the env reward itself is used as objective), then you can just increase the coefficients to increase the reward unboundedly.

Configuring PBT

Please see Configuration parameter reference. Parameters with --pbt_ prefix are related to PBT. Use --with_pbt=True to enable PBT. It is important also to set --num_policies to the number of agents in the population.

Command-line examples

Training on DMLab-30 with a 4-agent population on a 4-GPU machine:

python -m sf_examples.dmlab.train_dmlab --env=dmlab_30 --train_for_env_steps=10000000000 --algo=APPO --gamma=0.99 --use_rnn=True --num_workers=90 --num_envs_per_worker=12 --num_epochs=1 --rollout=32 --recurrence=32 --batch_size=2048 --benchmark=False --max_grad_norm=0.0 --dmlab_renderer=software --decorrelate_experience_max_seconds=120 --encoder_conv_architecture=resnet_impala --encoder_conv_mlp_layers=512 --nonlinearity=relu --rnn_type=lstm --dmlab_extended_action_set=True --num_policies=4 --pbt_replace_reward_gap=0.05 --pbt_replace_reward_gap_absolute=5.0 --pbt_period_env_steps=10000000 --pbt_start_mutation=100000000 --with_pbt=True --experiment=dmlab_30_resnet_4pbt_w90_v12 --dmlab_one_task_per_worker=True --set_workers_cpu_affinity=True --max_policy_lag=35 --pbt_target_objective=dmlab_target_objective --dmlab30_dataset=~/datasets/brady_konkle_oliva2008 --dmlab_use_level_cache=True --dmlab_level_cache_path=/home/user/dmlab_cache

PBT for VizDoom (8 agents, 4-GPU machine):

python -m sf_examples.vizdoom.train_vizdoom --env=doom_deathmatch_bots --train_for_seconds=3600000 --algo=APPO --use_rnn=True --gamma=0.995 --env_frameskip=2 --num_workers=80 --num_envs_per_worker=24 --num_policies=8 --batch_size=2048 --res_w=128 --res_h=72 --wide_aspect_ratio=False --with_pbt=True --pbt_period_env_steps=5000000 --experiment=doom_deathmatch_bots