Advanced configuration

Configuration is an important part of every machine learning project. With Palladium, it is easy to separate code from configuration, and run code with different configurations.

Configuration files use Python syntax. For an introduction, please visit the Tutorial.

Palladium uses an environment variable called PALLADIUM_CONFIG to look up the location of the configuration file.

Variables

Configuration files have access to environment variables, which allows you to pass in things like database credentials from the environment:

'dataset_loader_train': {
    '__factory__': 'palladium.dataset.SQL',
    'url': 'mysql://{}:{}@localhost/test?encoding=utf8'.format(
        environ['DB_USER'], environ['DB_PASS'],
        ),
    'sql': 'SELECT ...',
    }

You also have access to here, which is the path to the directory that the configuration file lives in. In this example, we point the path variable to a file called data.csv inside of the same folder as the configuration:

'dataset_loader_train': {
    '__factory__': 'palladium.dataset.Table',
    'path': '{}/data.csv'.format(here),
    }

Multiple configuration files

In larger projects, it’s useful to split the configuration up into multiple files. Imagine you have a common config-data.py file and several config-model-X.py type files, each of which use the same data loader. When using multiple files, you must separate the filenames by commas: PALLADIUM_CONFIG=config-data.py,config-model-1.py.

If your configuration files share some entries (keys), then files coming later in the list will win and override entries from files earlier in the list. Thus, if the contents of config-data.py are {'a': 42, 'b': 6}” and the contents of config-model-1.py is {'b': 7, 'c': 99}, the resulting configuration will be {'a': 42, 'b': 7, 'c': 99}.

Avoiding duplication in your configuration

Even with multiple files, you’ll sometimes end up repeating portions of configuration between files. The __copy__ directive allows you to copy or override existing entries. Imagine your dataset loaders for train and test are identical, except for the location of the CSV file:

'dataset_loader_train': {
    '__factory__': 'palladium.dataset.Table',
    'path': '{}/train.csv'.format(here),
    'many': '...',
    'more': {'...'},
    'entries': ['...'],
    }

'dataset_loader_test': {
    '__factory__': 'palladium.dataset.Table',
    'path': '{}/test.csv'.format(here),
    'many': '...',
    'more': {'...'},
    'entries': ['...'],
    }

With __copy__, you can reduce this down to:

'dataset_loader_train': {
    '__factory__': 'palladium.dataset.Table',
    'path': '{}/train.csv'.format(here),
    'many': '...',
    'more': {'...'},
    'entries': ['...'],
    }

'dataset_loader_test': {
    '__copy__': 'dataset_loader_train',
    'path': '{}/test.csv'.format(here),
    }

Reducing duplication in your configuration can help avoid errors.