Sample Factory on Slurm¶
This section contains instructions for running Sample Factory experiments using Slurm.
Setting up¶
Login to your Slurm login node using ssh with your username and password. Start an interactive job with srun
to install files to your NFS.
Note that you may get a message groups: cannot find name for group ID XXXX
which is not an error.
Install Miniconda:
- Download installer using
wget
from https://docs.conda.io/en/latest/miniconda.html#linux-installers - Run the installer with
bash {Miniconda...sh}
Make new conda environment conda create --name sf2
then conda activate sf2
Download Sample Factory and install dependencies, for example:
git clone https://github.com/alex-petrenko/sample-factory.git
cd sample-factory
git checkout sf2
pip install -e .
# install additional env dependencies here if needed
Necessary scripts in Sample Factory¶
To run a custom launcher script for Sample Factory on slurm, you may need to write your own slurm_sbatch_template and/or launcher script.
slurm_sbatch_template is a bash script that run by slurm before your python script. It includes commands to activate your conda environment etc. See an example at ./sample_factory/launcher/slurm/sbatch_timeout.sh
. Variables in the bash script can be added in sample_factory.launcher.run_slurm
.
The launcher script controls the Python command slurm will run. Examples are located in sf_examples
. You can run multiple experiments with different parameters using ParamGrid
.
Timeout script¶
If your slurm cluster has time limits for jobs, you can use the sbatch_timeout.sh
bash script to launch jobs that timeout and requeue themselves before the time limit.
The time limit can be set with the --slurm_timeout
command line argument. It defaults to 0
which runs the job with no time limit.
It is recommended the timeout be set to slightly less than the time limit of your job. For example, if the time limit is 24 hours, you should set --slurm_timeout=23h
Running launcher scripts on Slurm¶
Activate your conda environment conda activate sf2
then cd sample-factory
Run your launcher script - an example mujoco launcher (replace run, slurm_sbatch_template, and slurm_workdir with appropriate values)
python -m sample_factory.launcher.run --run=sf_examples.mujoco.experiments.mujoco_all_envs --backend=slurm --slurm_workdir=./slurm_mujoco --experiment_suffix=slurm --slurm_gpus_per_job=1 --slurm_cpus_per_gpu=16 --slurm_sbatch_template=./sample_factory/launcher/slurm/sbatch_timeout.sh --pause_between=1 --slurm_print_only=False
The slurm_gpus_per_job
and slurm_cpus_per_gpu
determine the resources allocated to each job. You can view the jobs without running them by setting slurm_print_only=True
.
You can view the status of your jobs on nodes or the queue with squeue
and view the outputs of your experiments with tail -f {slurm_workdir}/*.out
. Cancel your jobs with scancel {job_id}