Tutorial M: Efficient parallelization with the toolkit
Contents
Why this tutorial?
A frequently asked question is how to efficiently parallelize code when running the Seldonian Engine and Experiments libraries. The answer is it depends on the application. In this short tutorial, we discuss the different scenarios where you might want to parallelize and what strategy is best for each scenario.
Parallelization scenarios
Scenario #1: Supervised learning using CPUs
This scenario is where we have a supervised learning problem that we can solve without using GPUs. This is typically the case when our dataset and model are relatively small (<1 GB), or we simply do not have access to GPUs. The Engine library is explicitly set up to run on a single CPU. However, if you monitor CPU usage on a multi-CPU machine while the engine is running, you may often see that multiple cores are in use. This is due to the implicit parallelization that NumPy performs when doing vector and matrix arithmetic and our heavy use of NumPy throughout the engine. This is fine for single engine runs (although in practice it does not actually seem to speed up runs of the engine), but it massively impedes performance when running Seldonian Experiments. When running the Experiments library in this scenario, it is very important to disable NumPy's implicit parallelization by including the following lines of code in your top-level script that runs the experiments before you import numpy.
This will turn off NumPy's internal parallelization and dramatically speed up your experiment code, assuming you have access to multiple CPUs. We suspect the reason for this is that NumPy's internal parallelization hogs threads that would otherwise be available for parallelization in the experiment, causing extra overhead for Python's explicit multiprocessing pools that we use in the Experiments library.
Setting the n_workers
parameter of the SupervisedPlotGenerator object to a value equal to the number of CPU cores you want to use in your experiment will then allow you to run your experiment trials in parallel.
Scenario #2: Supervised learning using the GPU
The toolkit is set up for running the engine and experiments library using a single GPU via PyTorch and Tensorflow (see Tutorial G: Creating your first Seldonian PyTorch model). In this case, you do not want to parallelize your experiments across CPUs, so keep n_workers=1
. This parameter refers to the number of CPU workers, and because the compute-intensive work is done on the GPU, sending more work to the GPU from a second CPU is just going to induce more overhead in your experiments. Whether you use NumPy's implicit parallelization (see previous scenario) does not matter significantly in this scenario. We are considering data parallelism as a way to run the toolkit on multiple GPUs, but that is still a work in progress. If you are especially interested in that scenario, please open an issue or submit a pull request.
Scenario #3: Reinforcement learning using CPUs
Currently, reinforcement learning (RL) policies are only supported on the CPU. Deep RL policy base classes that run on the GPU are technically possible with PyTorch or Tensorflow, but we have not implemented them yet. If you are especially interested in a deep RL application, please open an issue or submit a pull request. As in Scenario #1: Supervised learning using CPUs only, distributing experiments across multiple CPUs can dramatically speed up your RL experiments. As in Scenario #1, turn off NumPy's implicit parallelization to get the most efficiency when parallelizing experiments:
In RL experiments, the computational bottleneck may not necessarily be in running the Seldonian algorithm in each trial. Instead, it can sometimes be in the episode generation, both for creating the initial trial datasets and when evaluating a new policy's performance and safety. For this reason, we introduced n_workers_for_episode_generation
, a new key to the hyperparameter_and_setting_dict
dictionary that is an input to RLPlotGenerator class. This key controls how many CPUs are used for the episode generation steps. Note that if n_workers_for_episode_generation > 1
, then n_workers
should be kept at 1, and vice versa. Otherwise, you could have child processes forking more child processes, which massively slows down experiments. If the bottleneck in your experiments is running the Seldonian algorithm in each trial and not the episode generation, then set n_workers_for_episode_generation = 1
and set n_workers
to the number of CPUs you have available.