Experiment management¶

In lumo, theExperimentclass provides sufficient guarantees to ensure experiment reproducibility. Specifically,Experimentguarantees reproducibility from four perspectives: path management, version control, parameter recording, and backup. It also simplifies the operation threshold through visual panels, command-line interfaces, and other methods.

Path Management¶

To ensure that paths are not duplicated,Experimentassigns a unique experiment ID (test_name) to each experiment run. At the same time,Experimentprovides three different types of data storage paths for storing information (info_dir), binary files (blob_dir), and temporary files (cache_dir), with the following path relationships:

- <cache_root>
    - <exp_name>
        - <cache_dir>

- <info_root>
    - <exp_name>
        - <info_dir>

- <blob_root>
    - <exp_name>
        - <blob_dir>

Version Control¶

The lifecycle ofExperimentincludes start/progress/end, and a series ofExpHookclasses are set up to perform partial operations at each lifecycle stage. Among them,~lumo.exp.exphook.GitCommitis responsible for git commit, which checks for file changes aton_startand submits a snapshot of the current file to thelumo_experimentsbranch if changes exist. The commit information corresponding to the current code is recorded in theinfo_dirof theExperimentinstance and can be viewed throughexp.properties['git'].

Information Recording¶

Information recording includes startup parameters such as hyperparameters and program execution parameters, runtime and post-run parameters such as Metric, execution time, and other metadata. All information mentioned except for hyperparameters is automatically recorded byExperimentat.start(). The hyperparameters of the experiment can be recorded byexp.dump_info('params', params_dict).

When usinglumo.Trainerfor training, hyperparameters used are automatically recorded in theparamskey.

For Metric, theExperimentinstance can be recorded using.dump_metricand.dump_metrics(), for example:

max_acc = exp.dump_metric("acc",acc, "cls_acc", cls_acc)

Here shows an example in exp.properties

{'agent': nan,
 'backup': {'23-03-17-161847': {'backend': 'github',
                                'number': 4,
                                'repo': 'sailist/image-classification'}},
 'deprecated': nan,
 'exception': nan,
 'execute': {'cwd': '~/python/image-classification-private',
             'exec_argv': ['train_ssl.py',
                           'train_ssl.py',
                           '--module=simclr',
                           '--device=2',
                           '--config=config/ssl/simclr/cifar100.yaml',
                           '--model=wrn282',
                           '--scan=ssl-2023.02.28'],
             'exec_bin': '~/miniconda3/bin/python3',
             'exec_file': 'train_ssl.py',
             'repo': '~/python/image-classification-private'},
 'exp_name': 'simclr.simclrexp',
 'git': {'commit': '294ccdac',
         'dep_hash': '404fc6044b2119d56a5e8b92ac02fc1c',
         'repo': '~/python/image-classification-private'},
 'hooks': {'Diary': {'loaded': True, 'msg': ''},
           'FinalReport': {'loaded': True, 'msg': ''},
           'GitCommit': {'loaded': True, 'msg': ''},
           'LastCmd': {'loaded': True, 'msg': ''},
           'LockFile': {'loaded': True, 'msg': ''},
           'RecordAbort': {'loaded': True, 'msg': ''}},
 'lock': {'accelerate': '0.16.0',
          'decorator': '5.1.1',
          'fire': '0.5.0',
          'hydra': '1.3.1',
          'joblib': '1.2.0',
          'lumo': '0.15.0',
          'numpy': '1.24.2',
          'omegaconf': '2.3.0',
          'psutil': '5.9.4',
          'torch': '1.8.1+cu101',
          'torch.version.cuda': '10.1'},
 'note': '',
 'params': {'apply_mixco': False,
            'apply_unmix': False,
            'config': 'config/ssl/simclr/cifar100.yaml',
            'dataset': 'cifar100',
            'detach_cls': True,
            'device': 2,
            'ema': True,
            'ema_alpha': 0.99,
            'epoch': 1000,
            'eval': {'batch_size': 512,
                     'num_workers': 8,
                     'pin_memory': True,
                     'shuffle': True},
            'feature_dim': 128,
            'hidden_feature_size': 128,
            'knn': True,
            'knn_k': 200,
            'knn_t': 0.1,
            'linear_eval': False,
            'lr_decay_end': 0.0005,
            'method': 'simclr',
            'model': 'wrn282',
            'module': 'simclr',
            'more_sample': True,
            'n_classes': 100,
            'optim': {'lr': 0.06,
                      'momentum': 0.9,
                      'name': 'SGD',
                      'weight_decay': 0.0005},
            'pretrain_path': None,
            'scan': 'ssl-2023.02.28',
            'seed': 1,
            'semi_eval': False,
            'stl10_unlabeled': True,
            'temperature': 0.1,
            'test': {'batch_size': 512,
                     'num_workers': 8,
                     'pin_memory': True,
                     'shuffle': False},
            'train': {'batch_size': 512,
                      'num_workers': 8,
                      'pin_memory': True,
                      'shuffle': True},
            'train_ending': 10,
            'train_linear': True,
            'train_strategy': 'ending',
            'warmup_epochs': 0,
            'warmup_from': 0.01,
            'with_bn': False},
 'pinfo': {'hash': '62ee6de98b381872e200e82901ad51f7',
           'obj': {'argv': ['~/miniconda3/bin/python3',
                            'train_ssl.py',
                            'train_ssl.py',
                            '--module=simclr',
                            '--device=2',
                            '--config=config/ssl/simclr/cifar100.yaml',
                            '--model=wrn282',
                            '--scan=ssl-2023.02.28'],
                   'pid': 27687,
                   'pname': 'python3',
                   'pstart': 1678763482.5},
           'pid': 27687},
 'progress': {'finished': False,
              'last_edit_time': '23-03-14-212932',
              'ratio': 1.0,
              'start': '23-03-14-111124',
              'update_from': None},
 'rerun': {'from': '230313.015.99t', 'repeat': 1},
 'test_name': '230314.000.a3t',
 ...
 }

Retrieve Experiment¶

Watchconsolidates information for all experiments, allowing users to search for a specific experiment.

from lumo import Watcher, Experiment

w = Watcher()
df = w.load() # all experiments

exp = Experiment.from_cache(df.iloc[0].to_dict())

For a known experiment withtest_name, theExperimentinstance can be directly retrieved using theretrievemethod:

w.retrieve('230306.012.d5t')
>>> Experiment(info_dir=".../.lumo/experiments/moco.mocoexp/230306.012.d5t")

Visual Panel¶

A fixed-style panel can never satisfy everyone’s needs. Therefore, lumo provides dynamic panels based on pandas and panel, with all styles except for a few fixed parts added by the user:

from lumo import Watcher
w = Watcher()
df = w.load()

... filter operations ...

new_df = ...

w.panel(new_df)

Repetitive Experiment¶

Repetitive experiments mainly occur in two scenarios:

To verify the stability of the results, rerun the experiment with other random seeds and the same parameters.
In the middle of the experiment, due to memory, disk space, or other reasons, the experiment failed and needs to be rerun with similar parameters.

Especially when scanning parameters, if only