ml4chem.data package

Submodules

ml4chem.data.handler module

ml4chem.data.parser module

ml4chem.data.preprocessing module

class ml4chem.data.preprocessing.Preprocessing(preprocessor, purpose)[source]

Bases: object

A wrap for preprocessing data with sklearn

This intends to be a wrapper around sklearn. The idea is to make easier to preprocess data without too much burden to users.

Parameters:

preprocessor (tuple) – Tuple with structure: (‘name’, {kwargs}).
purpose (str) – Supported purposes are : ‘training’, ‘inference’.

Notes

The list of preprocessing modules available on sklearn and options can be found at:

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

If you need a preprocessor that is not implemented yet, just create a bug report or follow the structure shown below to implement it yourself (PR are very welcomed). In principle, all preprocessors can be implemented.

fit(stacked_features, scheduler)[source]

Fit features

Parameters:

stacked_features (list) – List of stacked features.
scheduler (str) – What is the scheduler to be used in dask.

Returns:

scaled_features – Scaled features using requested preprocessor.

Return type:

list

save_to_file(preprocessor, path)[source]

Save the preprocessor object to file

Parameters:

preprocessor (obj) – Preprocessing object
path (str) – Path to save .prep file.

set(purpose)[source]

Set a preprocessing method

Parameters:: purpose (str) – Supported purposes are : ‘training’, ‘inference’.
Return type:: Preprocessor object.

transform(raw_features)[source]

Transform features to scaled features

Given a Preprocessor object, we return features.

Parameters:: raw_features (list) – Unscaled features.
Returns:: scaled_features – Scaled features using the scaler set in self.set().
Return type:: list

ml4chem.data.serialization module

ml4chem.data.serialization.dump(data, filename='data.db')[source]

Serialize data

This function allows to dump data and ML4Chem dictionaries serialized with msgpack, or torch (depending on the models).

Parameters:

data (dict or array) – A dictionary or array containting data to be saved to file using msgpack.
filename (str) – Name of file to save in disk.

ml4chem.data.serialization.load(filename)[source]

Load a msgpack file

Parameters:: filename (str) – Path of file to load from disk.

ml4chem.data package

Submodules

ml4chem.data.handler module

ml4chem.data.parser module

ml4chem.data.preprocessing module

ml4chem.data.serialization module

ml4chem.data.utils module

Module contents