mlpack.preprocess_split

preprocess_split(...)Split Data

>>> from mlpack import preprocess_split

This utility takes a dataset and optionally labels and splits them into a training set and a test set. Before the split, the points in the dataset are randomly reordered. The percentage of the dataset to be used as the test set can be specified with the 'test_ratio' parameter; the default is 0.2 (20%).

The output training and test matrices may be saved with the 'training' and 'test' output parameters.

Optionally, labels can be also be split along with the data by specifying the 'input_labels' parameter. Splitting labels works the same way as splitting the data. The output training and test labels may be saved with the 'training_labels' and 'test_labels' output parameters, respectively.

So, a simple example where we want to split the dataset 'X' into 'X_train' and 'X_test' with 60% of the data in the training set and 40% of the dataset in the test set, we could run

>>> preprocess_split(input=X, test_ratio=0.4)

>>> X_train = output['training']

>>> X_test = output['test']

If we had a dataset 'X' and associated labels 'y', and we wanted to split these into 'X_train', 'y_train', 'X_test', and 'y_test', with 30% of the data in the test set, we could run

>>> preprocess_split(input=X, input_labels=y, test_ratio=0.3)

>>> X_train = output['training']

>>> y_train = output['training_labels']

>>> X_test = output['test']

>>> y_test = output['test_labels']

## input options

- input (numpy matrix or arraylike, float dtype): [required] Matrix containing data.
- copy_all_inputs (bool): If specified, all input parameters will be deep copied before the method is run. This is useful for debugging problems where the input parameters are being modified by the algorithm, but can slow down the code.
- input_labels (numpy matrix or arraylike, int/long dtype): Matrix containing labels.
- seed (int): Random seed (0 for std::time(NULL)). Default value 0.
- test_ratio (float): Ratio of test set; if not set,the ratio defaults to 0.2 Default value 0.2.
- verbose (bool): Display informational messages and the full list of parameters and timers at the end of execution.

## output options

The return value from the binding is a dict containing the following elements:

- test (numpy matrix, float dtype): Matrix to save test data to.
- test_labels (numpy matrix, int dtype): Matrix to save test labels to.
- training (numpy matrix, float dtype): Matrix to save training data to.
- training_labels (numpy matrix, int dtype): Matrix to save train labels to.