ml4chem.data package
Submodules
ml4chem.data.handler module
- class ml4chem.data.handler.Data(images, purpose=None)[source]
Bases:
object
A Data class
An adequate data structure is very important to develop machine-learning models. In general a model receives a data set (X) and a target vector (y). This class should in principle arrange this in a format that can be vectorized and operate not only with neural networks but also with support vector machines.
The central object here is the data set.
- Parameters:
images (list or object) – List of images. Supported format is from ASE.
purpose (str) – Is this data for training or inference purpose?. Supported strings are: “training”, and “inference”.
- get_data(purpose=None)[source]
A method to get data
- Parameters:
purpose (str) – The purpose of the data so that structure is prepared accordingly. Supported are: ‘training’, ‘inference’
- Returns:
self.images (dict) – Ordered dictionary of images corresponding to order of self.targets list.
self.targets (list) – Targets used for training the model.
- get_unique_element_symbols(images=None, purpose=None)[source]
Unique element symbol in data set
- Parameters:
images (list of images.) – ASE object.
purpose (str) – The supported categories are: ‘training’, ‘inference’.
- is_valid_structure(images)[source]
Check if the data has a valid structure
- Parameters:
images (list of atoms) – List of images.
- Returns:
valid – Whether or not the structure is valid.
- Return type:
bool
ml4chem.data.parser module
- class ml4chem.data.parser.SinglePointCalculator(implemented_properties=None)[source]
Bases:
Calculator
A SinglePointCalculator class
This class creates a fake calculator that is used to populate calc.results dictionaries in ASE objects.
- Parameters:
implemented_properties (list) – List with supported properties.
- ml4chem.data.parser.ani_to_ase(hdf5file, data_keys, trajfile=None)[source]
ANI to ASE
- Parameters:
hdf5file (hdf5, list) – hdf5 file loaded using pyanitools (or list of them).
data_keys (list) – List of keys to extract data.
trajfile (str, optional) – Name of trajectory file to be saved, by default None.
- Returns:
A list of Atoms objects.
- Return type:
atoms
ml4chem.data.preprocessing module
- class ml4chem.data.preprocessing.Preprocessing(preprocessor, purpose)[source]
Bases:
object
A wrap for preprocessing data with sklearn
This intends to be a wrapper around sklearn. The idea is to make easier to preprocess data without too much burden to users.
- Parameters:
preprocessor (tuple) – Tuple with structure: (‘name’, {kwargs}).
purpose (str) – Supported purposes are : ‘training’, ‘inference’.
Notes
The list of preprocessing modules available on sklearn and options can be found at:
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing
If you need a preprocessor that is not implemented yet, just create a bug report or follow the structure shown below to implement it yourself (PR are very welcomed). In principle, all preprocessors can be implemented.
- fit(stacked_features, scheduler)[source]
Fit features
- Parameters:
stacked_features (list) – List of stacked features.
scheduler (str) – What is the scheduler to be used in dask.
- Returns:
scaled_features – Scaled features using requested preprocessor.
- Return type:
list
- save_to_file(preprocessor, path)[source]
Save the preprocessor object to file
- Parameters:
preprocessor (obj) – Preprocessing object
path (str) – Path to save .prep file.
ml4chem.data.serialization module
- ml4chem.data.serialization.dump(data, filename='data.db')[source]
Serialize data
This function allows to dump data and ML4Chem dictionaries serialized with msgpack, or torch (depending on the models).
- Parameters:
data (dict or array) – A dictionary or array containting data to be saved to file using msgpack.
filename (str) – Name of file to save in disk.
ml4chem.data.utils module
- ml4chem.data.utils.ase_to_xyz(atoms, comment='', file=True)[source]
Convert ASE to xyz
This function is useful to save xyz to DataFrame.
- ml4chem.data.utils.split_data(images, training_name='training_images.traj', test_name='test_images.traj', randomize=True, test_set=20, logfile='data_split.log')[source]
Split Data
- Parameters:
images (str or object) – A path to an ASE trajectory file or a list of Atoms objects.
training_name (str, optional) – Name of the training set trajectory file, by default ‘training_images.traj’
test_name (str, optional) – Name of the test set file, by default ‘test_images.traj’
randomize (bool, optional) – Randomize indices of images, by default True
test_set (int, optional) – Percentage of the Data to be used as test set, by default 20
logfile (str, optional) – Log file name, by default ‘data_split.log’