Hyperparameter tuning¶
Optuna¶
Optuna is a software used for hyperparameter tuning. It is agnostic to what you are trying to tune, and highly suitable for distributed computing.
Briefly, the way it works is that Optuna sets up or loads an object called a
Study
, which can be tracked using various database structures - in this case,
we will be using a journal file, as this is the most robust approach with the
fewest points of failure. The Study
has a Sampler
attached which contains an
algorithm for picking hyperparameters - for example, a grid-based or random
sampler. The study can then be used to generate a number of Trial
s, where each
trial can query the Sampler
for values for the set of hyperparameters to be
tuned, based on the set of trials completed thus far in the study. The Trial
is the argument to an objective function (a loss or acquisition function), in
which the querying for hyperparameters is carried out; the objective function
needs to return a numerical value by which the success of the trial is
evaluated. The Study
automatically keeps track of all hyperparameters that
have been tried thus far, as well as the results of all trials that have come
in; a variety of advanced usage, such as pruning specific trials, is possible.
Optuna is a flexible framework, but we will show some examples suitable for
usage on a Slurm cluster. An Optuna study can look like this, using sklearn
for example:
We can see some of the output from the journal log (here lightly edited for readability):
{"op_code": 5, "worker_id": "ec5b2781-8724-4351-8e7e-7cad2cc457f0-2259702416",
"trial_id": 100, "param_name": "min_samples_leaf", "param_value_internal": 1.0,
"distribution": {"name": "IntDistribution",
"attributes": {
"log": false,
"step": 1,
"low": 1,
"high": 10}}}
{"op_code": 8, "worker_id": "c86b6df2-2b1c-4a53-825f-94d22a4a7604-2328590375",
"trial_id": 98, "user_attr": {"timestamp": "2024-11-08 11:30:52"}}
{"op_code": 6, "worker_id": "c86b6df2-2b1c-4a53-825f-94d22a4a7604-2328590375",
"trial_id": 98, "state": 1, "values": [0.9298245614035088],
"datetime_complete": "2024-11-08T11:30:52.108614"}
{"op_code": 8, "worker_id": "ec5b2781-8724-4351-8e7e-7cad2cc457f0-2259702416",
"trial_id": 100, "user_attr": {"timestamp": "2024-11-08 11:30:52"}}
{"op_code": 6, "worker_id": "ec5b2781-8724-4351-8e7e-7cad2cc457f0-2259702416",
"trial_id": 100, "state": 1, "values": [0.9824561403508771],
"datetime_complete": "2024-11-08T11:30:52.242400"}
We can get the best result from the study via the Optuna CLI, here in yaml
format:
$ ml Optuna
$ optuna --storage db/journal.log --storage-class JournalFileStorage \
--study-name my_study best-trial -f yaml
datetime_complete: "2024-11-08 11:30:38"
datetime_start: "2024-11-08 11:30:38"
duration: "0:00:00.342559"
number: 21
params:
max_depth: 3
min_samples_leaf: 8
min_samples_split: 8
n_estimators: 76
state: COMPLETE
user_attrs:
timestamp: "2024-11-08 11:30:38"
value: 0.9912280701754386
In Optuna, we can use the best trial (or the best few trials) as a starting
point for fine-tuning our hyperparameters. For example, you can use the number
property to save the state of each run, and then load the state of your best run
in your next optimization to continue tuning. Alternatively, you can use your
trial scores to generate a new set of constraints for further optimization -
there are many different options. You can even continue the same study and use
the entirety of the random sampling to inform
the other samplers available in Optuna,
or you can
write your own sampler
thanks to Optuna's object-oriented structure.
Advanced usage¶
MPI¶
Using MPI with Optuna is straightforward.
Services¶
Given a set of optimization scripts that are suitably written, you can run series of batch optimizations programmatically. You can run a small service on the login node that submits jobs accordingly by using the Optuna CLI, for example:
Enable the service by running the following:
$ systemctl --user daemon-reload && \
systemctl --user enable optuna-slurm && \
systemctl --user start optuna-slurm
You can check the service by running systemctl --user status optuna-slurm
. The
advantage of such a service is that it can survive reboots of the login node,
which is an advantage for very long-running batches. The job of distributing the
workload is then left to Slurm.