ml4chem.data package

Submodules

ml4chem.data.handler module

class ml4chem.data.handler.Data(images, purpose=None)[source]

Bases: object

A Data class

An adequate data structure is very important to develop machine-learning models. In general a model receives a data set (X) and a target vector (y). This class should in principle arrange this in a format that can be vectorized and operate not only with neural networks but also with support vector machines.

The central object here is the data set.

Parameters
  • images (list or object) – List of images. Supported format is from ASE.

  • purpose (str) – Is this data for training or inference purpose?. Supported strings are: “training”, and “inference”.

get_data(purpose=None)[source]

A method to get data

Parameters

purpose (str) – The purpose of the data so that structure is prepared accordingly. Supported are: ‘training’, ‘inference’

Returns

  • self.images (dict) – Ordered dictionary of images corresponding to order of self.targets list.

  • self.targets (list) – Targets used for training the model.

get_total_number_atoms()[source]

Get the total number of atoms

get_unique_element_symbols(images=None, purpose=None)[source]

Unique element symbol in data set

Parameters
  • images (list of images.) – ASE object.

  • purpose (str) – The supported categories are: ‘training’, ‘inference’.

is_valid_structure(images)[source]

Check if the data has a valid structure

Parameters

images (list of atoms) – List of images.

Returns

valid – Whether or not the structure is valid.

Return type

bool

prepare_images(images, purpose=None)[source]

Function to prepare images to operate with ML4Chem

Parameters
  • images (list or object) – List of images.

  • purpose (str) – The purpose of the data so that structure is prepared accordingly. Supported are: ‘training’, ‘inference’

to_pandas()[source]

Convert data to pandas DataFrame

ml4chem.data.parser module

class ml4chem.data.parser.SinglePointCalculator(implemented_properties=None)[source]

Bases: ase.calculators.calculator.Calculator

A SinglePointCalculator class

This class creates a fake calculator that is used to populate calc.results dictionaries in ASE objects.

Parameters

implemented_properties (list) – List with supported properties.

static get_forces(atoms)[source]

Get atomic forces

Parameters

atoms (obj) – Atoms objects

Returns

The atomic force of the molecule.

Return type

forces

static get_potential_energy(atoms)[source]

Get the potential energy

Parameters

atoms (obj) – Atoms objects

Returns

The energy of the molecule.

Return type

energy

ml4chem.data.parser.ani_to_ase(hdf5file, data_keys, trajfile=None)[source]

ANI to ASE

Parameters
  • hdf5file (hdf5, list) – hdf5 file loaded using pyanitools (or list of them).

  • data_keys (list) – List of keys to extract data.

  • trajfile (str, optional) – Name of trajectory file to be saved, by default None.

Returns

A list of Atoms objects.

Return type

atoms

ml4chem.data.parser.cjson_parser(cjsonfile, trajfile=None)[source]

Parse CJSON files

Parameters
  • cjsonfile (str) – Path to the CJSON file.

  • trajfile (str, optional) – Name of trajectory file to be saved, by default None.

Returns

A list of Atoms objects.

Return type

atoms

ml4chem.data.parser.cjson_to_ase(cjson)[source]
ml4chem.data.parser.get_total_energy(cjson)[source]

ml4chem.data.preprocessing module

class ml4chem.data.preprocessing.Preprocessing(preprocessor, purpose)[source]

Bases: object

A wrap for preprocessing data with sklearn

This intends to be a wrapper around sklearn. The idea is to make easier to preprocess data without too much burden to users.

Parameters
  • preprocessor (tuple) – Tuple with structure: (‘name’, {kwargs}).

  • purpose (str) – Supported purposes are : ‘training’, ‘inference’.

Notes

The list of preprocessing modules available on sklearn and options can be found at:

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

If you need a preprocessor that is not implemented yet, just create a bug report or follow the structure shown below to implement it yourself (PR are very welcomed). In principle, all preprocessors can be implemented.

fit(stacked_features, scheduler)[source]

Fit features

Parameters
  • stacked_features (list) – List of stacked features.

  • scheduler (str) – What is the scheduler to be used in dask.

Returns

scaled_features – Scaled features using requested preprocessor.

Return type

list

save_to_file(preprocessor, path)[source]

Save the preprocessor object to file

Parameters
  • preprocessor (obj) – Preprocessing object

  • path (str) – Path to save .prep file.

set(purpose)[source]

Set a preprocessing method

Parameters

purpose (str) – Supported purposes are : ‘training’, ‘inference’.

Returns

Return type

Preprocessor object.

transform(raw_features)[source]

Transform features to scaled features

Given a Preprocessor object, we return features.

Parameters

raw_features (list) – Unscaled features.

Returns

scaled_features – Scaled features using the scaler set in self.set().

Return type

list

ml4chem.data.serialization module

ml4chem.data.serialization.dump(data, filename='data.db')[source]

Serialize data

This function allows to dump data and ML4Chem dictionaries serialized with msgpack, or torch (depending on the models).

Parameters
  • data (dict or array) – A dictionary or array containting data to be saved to file using msgpack.

  • filename (str) – Name of file to save in disk.

ml4chem.data.serialization.load(filename)[source]

Load a msgpack file

Parameters

filename (str) – Path of file to load from disk.

ml4chem.data.utils module

ml4chem.data.utils.ase_to_xyz(atoms, comment='', file=True)[source]

Convert ASE to xyz

This function is useful to save xyz to DataFrame.

ml4chem.data.utils.split_data(images, training_name='training_images.traj', test_name='test_images.traj', randomize=True, test_set=20, logfile='data_split.log')[source]

Split Data

Parameters
  • images (str or object) – A path to an ASE trajectory file or a list of Atoms objects.

  • training_name (str, optional) – Name of the training set trajectory file, by default ‘training_images.traj’

  • test_name (str, optional) – Name of the test set file, by default ‘test_images.traj’

  • randomize (bool, optional) – Randomize indices of images, by default True

  • test_set (int, optional) – Percentage of the Data to be used as test set, by default 20

  • logfile (str, optional) – Log file name, by default ‘data_split.log’

Module contents