ml4chem.data package

Submodules

ml4chem.data.handler module

ml4chem.data.parser module

ml4chem.data.preprocessing module

class ml4chem.data.preprocessing.Preprocessing(preprocessor, purpose)[source]

Bases: object

A wrap for preprocessing data with sklearn

This intends to be a wrapper around sklearn. The idea is to make easier to preprocess data without too much burden to users.

Parameters:
  • preprocessor (tuple) – Tuple with structure: (‘name’, {kwargs}).

  • purpose (str) – Supported purposes are : ‘training’, ‘inference’.

Notes

The list of preprocessing modules available on sklearn and options can be found at:

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

If you need a preprocessor that is not implemented yet, just create a bug report or follow the structure shown below to implement it yourself (PR are very welcomed). In principle, all preprocessors can be implemented.

fit(stacked_features, scheduler)[source]

Fit features

Parameters:
  • stacked_features (list) – List of stacked features.

  • scheduler (str) – What is the scheduler to be used in dask.

Returns:

scaled_features – Scaled features using requested preprocessor.

Return type:

list

save_to_file(preprocessor, path)[source]

Save the preprocessor object to file

Parameters:
  • preprocessor (obj) – Preprocessing object

  • path (str) – Path to save .prep file.

set(purpose)[source]

Set a preprocessing method

Parameters:

purpose (str) – Supported purposes are : ‘training’, ‘inference’.

Return type:

Preprocessor object.

transform(raw_features)[source]

Transform features to scaled features

Given a Preprocessor object, we return features.

Parameters:

raw_features (list) – Unscaled features.

Returns:

scaled_features – Scaled features using the scaler set in self.set().

Return type:

list

ml4chem.data.serialization module

ml4chem.data.serialization.dump(data, filename='data.db')[source]

Serialize data

This function allows to dump data and ML4Chem dictionaries serialized with msgpack, or torch (depending on the models).

Parameters:
  • data (dict or array) – A dictionary or array containting data to be saved to file using msgpack.

  • filename (str) – Name of file to save in disk.

ml4chem.data.serialization.load(filename)[source]

Load a msgpack file

Parameters:

filename (str) – Path of file to load from disk.

ml4chem.data.utils module

Module contents