Skip to content

Latest commit

 

History

History
413 lines (274 loc) · 13.3 KB

File metadata and controls

413 lines (274 loc) · 13.3 KB

What all can I do with this repository structure?

This document outlines all the advanced features and capabilities available in PRiSM.

⚡ Your Superpowers

Override any config parameter from command line
python src/main.py trainer.max_epochs=20 model.optimizer.lr=1e-4

Note: You can also add new parameters with + sign.

python src/main.py +model.new_param="owo"
Train on CPU, GPU, multi-GPU and TPU
# train on CPU
python src/main.py trainer=cpu

# train on 1 GPU
python src/main.py trainer=gpu

# train on TPU
python src/main.py +trainer.tpu_cores=8

# train with DDP (Distributed Data Parallel) (4 GPUs)
python src/main.py trainer=ddp trainer.devices=4

# train with DDP (Distributed Data Parallel) (8 GPUs, 2 nodes)
python src/main.py trainer=ddp trainer.devices=4 trainer.num_nodes=2

# simulate DDP on CPU processes
python src/main.py trainer=ddp_sim trainer.devices=2

# accelerate training on mac
python src/main.py trainer=mps
Train with mixed precision
# train with pytorch native automatic mixed precision (AMP)
python src/main.py trainer=gpu +trainer.precision=16
Train model with any logger available in PyTorch Lightning, like W&B or Tensorboard
# set project and entity names in `configs/logger/wandb`
wandb:
  project: "your_project_name"
  entity: "your_wandb_team_name"
# train model with Weights&Biases (link to wandb dashboard should appear in the terminal)
python src/main.py logger=wandb

Note: Lightning provides convenient integrations with most popular logging frameworks. Learn more here.

Note: Using wandb requires you to setup account first. After that just complete the config as below.

Note: Click here to see example wandb dashboard generated with this template.

Train model with chosen experiment config
# For probing experiments (task_dataset_model format)
python src/main.py experiment=probing/lid_fleurs_powsm

# For inference experiments
python src/main.py experiment=inference/transcribe_powsm data=doreco data.dataset_name=voxangeles task_name=inf_voxangeles_powsm

Note: Experiment configs are organized in configs/experiment/ with subdirectories:

  • probing/ - Probing experiment configs (format: task_dataset_model)
  • inference/ - Inference experiment configs
  • cascade/ - Cascade experiment configs
Attach some callbacks to run
python src/main.py callbacks=default

Note: Callbacks can be used for things such as as model checkpointing, early stopping and many more.

Note: Callbacks that monitor metrics (e.g., model_checkpoint, early_stopping) are only active during training mode (train: True). They are not used during inference.

Note: Callbacks configs are placed in configs/callbacks/.

Use different tricks available in Pytorch Lightning
# gradient clipping may be enabled to avoid exploding gradients
python src/main.py +trainer.gradient_clip_val=0.5

# run validation loop 4 times during a training epoch
python src/main.py +trainer.val_check_interval=0.25

# accumulate gradients
python src/main.py +trainer.accumulate_grad_batches=10

# terminate training after 12 hours
python src/main.py +trainer.max_time="00:12:00:00"

Note: PyTorch Lightning provides about 40+ useful trainer flags.

Easily debug
# runs 1 epoch in default debugging mode
# uses Hydra's run directory for all outputs
# sets level of all command line loggers to 'DEBUG'
# enforces debug-friendly configuration
python src/main.py debug=default

# run 1 train, val and test loop, using only 1 batch
python src/main.py debug=fdr

# print execution time profiling
python src/main.py debug=profiler

# try overfitting to 1 batch
python src/main.py debug=overfit

# raise exception if there are any numerical anomalies in tensors, like NaN or +/-inf
python src/main.py +trainer.detect_anomaly=true

# use only 20% of the data
python src/main.py +trainer.limit_train_batches=0.2 \
+trainer.limit_val_batches=0.2 +trainer.limit_test_batches=0.2

Note: Visit configs/debug/ for different debugging configs.

Resume training from checkpoint
python src/main.py ckpt_path="/path/to/ckpt/name.ckpt"

Note: Checkpoint can be either path or URL.

Note: Currently loading ckpt doesn't resume logger experiment, but it will be supported in future Lightning release.

Evaluate checkpoint on test dataset
python src/main.py test=True train=False ckpt_path="/path/to/ckpt/name.ckpt" experiment=your_experiment

Note: Checkpoint can be either path or URL. Note: You need to specify the experiment config that matches your training configuration.

Create a sweep over hyperparameters
# this will run 6 experiments one after the other,
# each with different combination of batch_size and learning rate
python src/main.py -m data.batch_size=32,64,128 model.lr=0.001,0.0005

Note: Hydra composes configs lazily at job launch time. If you change code or configs after launching a job/sweep, the final composed configs might be impacted.

Create a sweep over hyperparameters with Optuna
# this will run hyperparameter search defined in `configs/hparams_search/mnist_optuna.yaml`
# over chosen experiment config
python src/main.py -m hparams_search=mnist_optuna experiment=probing/lid_fleurs_powsm

Note: Using Optuna Sweeper doesn't require you to add any boilerplate to your code, everything is defined in a single config file.

Warning: Optuna sweeps are not failure-resistant (if one job crashes then the whole sweep crashes).

Execute all experiments from folder
# Execute all probing experiments
python src/main.py -m 'experiment=probing/glob(*)'

# Execute all inference experiments
python src/main.py -m 'experiment=inference/glob(*)'

Note: Hydra provides special syntax for controlling behavior of multiruns. Learn more here. Experiment configs are organized in subdirectories under configs/experiment/.

Execute run for multiple different seeds
python src/main.py -m seed=1,2,3,4,5 trainer.deterministic=True logger=csv tags=["benchmark"]

Note: trainer.deterministic=True makes pytorch more deterministic but impacts the performance.

Execute sweep on a remote AWS cluster

Note: This should be achievable with simple config using Ray AWS launcher for Hydra. Example is not implemented in this template.

Use Hydra tab completion

Note: Hydra allows you to autocomplete config argument overrides in shell as you write them, by pressing tab key. Read the docs.

Apply pre-commit hooks
pre-commit run -a

Note: Apply pre-commit hooks to do things like auto-formatting code and configs, performing code analysis or removing output from jupyter notebooks.

Update pre-commit hook versions in .pre-commit-config.yaml with:

pre-commit autoupdate
Run tests
# run all tests
pytest

# run tests from specific file
pytest tests/test_train.py

# run all tests except the ones marked as slow
pytest -k "not slow"
Use tags

Each experiment should be tagged in order to easily filter them across files or in logger UI:

python src/main.py tags=["fleurs","powsm","lid"]

Note: Tags are structured as [dataset, model, task] in probing experiments for consistency.

Note: You might need to escape the bracket characters in your shell with python src/main.py tags=\["fleurs","powsm","lid"\].

If no tags are provided, you will be asked to input them from command line:

>>> python src/main.py tags=[]
[2022-07-11 15:40:09,358][src.utils.utils][INFO] - Enforcing tags! <cfg.extras.enforce_tags=True>
[2022-07-11 15:40:09,359][src.utils.rich_utils][WARNING] - No tags provided in config. Prompting user to input tags...
Enter a list of comma separated tags (dev):

If no tags are provided for multirun, an error will be raised:

>>> python src/main.py -m +x=1,2,3 tags=[]
ValueError: Specify tags before launching a multirun!

Note: Appending lists from command line is currently not supported in hydra :(

Hyperparameter Search

You can define hyperparameter search by adding new config file to configs/hparams_search.

Show example hyperparameter search config
# @package _global_

defaults:
  - override /hydra/sweeper: optuna

# choose metric which will be optimized by Optuna
# make sure this is the correct name of some metric logged in lightning module!
optimized_metric: "val/acc_best"

# here we define Optuna hyperparameter search
# it optimizes for value returned from function with @hydra.main decorator
hydra:
  sweeper:
    _target_: hydra_plugins.hydra_optuna_sweeper.optuna_sweeper.OptunaSweeper

    # 'minimize' or 'maximize' the objective
    direction: maximize

    # total number of runs that will be executed
    n_trials: 20

    # choose Optuna hyperparameter sampler
    # docs: https://optuna.readthedocs.io/en/stable/reference/samplers.html
    sampler:
      _target_: optuna.samplers.TPESampler
      seed: 1234
      n_startup_trials: 10 # number of random sampling runs before optimization starts

    # define hyperparameter search space
    params:
      model.optimizer.lr: interval(0.0001, 0.1)
      data.batch_size: choice(32, 64, 128, 256)
      model.net.lin1_size: choice(64, 128, 256)
      model.net.lin2_size: choice(64, 128, 256)
      model.net.lin3_size: choice(32, 64, 128, 256)

Next, execute it with: python src/main.py -m hparams_search=mnist_optuna

Using this approach doesn't require adding any boilerplate to code, everything is defined in a single config file. The only necessary thing is to return the optimized metric value from the launch file.

You can use different optimization frameworks integrated with Hydra, like Optuna, Ax or Nevergrad.

Optimization results will be saved under Hydra's run directory (see configs/hydra).

This approach doesn't support resuming interrupted search and advanced techniques like prunning - for more sophisticated search and workflows, you should probably write a dedicated optimization task (without multirun feature).

Distributed Training

Lightning supports multiple ways of doing distributed training. The most common one is DDP, which spawns separate process for each GPU and averages gradients between them. To learn about other approaches read the lightning docs.

You can run DDP with 4 GPUs like this:

python src/main.py trainer=ddp trainer.devices=4

Note: When using DDP you have to be careful how you write your models - read the docs.

Experiment Tracking

PyTorch Lightning supports many popular logging frameworks: Weights&Biases, Neptune, Comet, MLFlow, Tensorboard.

These tools help you keep track of hyperparameters and output metrics and allow you to compare and visualize results. To use one of them simply complete its configuration in configs/logger and run:

python src/main.py logger=logger_name

You can use many of them at once (see configs/logger/many_loggers.yaml for example).

You can also write your own logger.

Lightning provides convenient method for logging custom metrics from inside LightningModule. Read the docs or take a look at geolocation example.