mlpack

Data loading and I/O

mlpack provides the Load() and Save() functions to load and save Armadillo matrices (e.g. numeric and categorical datasets) and any mlpack object via the cereal serialization toolkit. A number of other utilities related to loading and saving data and objects are also available. The Load() and Save() functions have numerous options to configure load/save behavior and format detection/selection.

πŸ”— Load()

For some types of data, it is also possible to load multiple images at once from a set of files:


Simple example:

// See https://datasets.mlpack.org/iris.csv.
arma::mat x;
mlpack::Load("iris.csv", x);

std::cout << "Loaded iris.csv; size " << x.n_rows << " x " << x.n_cols << "."
    << std::endl;

Among other things, the file format can be easily specified:

// See https://datasets.mlpack.org/iris.csv.
arma::mat x;
mlpack::Load("iris.csv", x, mlpack::CSV);

std::cout << "Loaded iris.csv; size " << x.n_rows << " x " << x.n_cols << "."
    << std::endl;

See also the other examples for each supported load type:

πŸ”— Save()

Note: when saving images, it is possible to save into multiple images from one matrix X. See image data for more details.


Simple example:

// Generate a 5-dimensional matrix of random data.
arma::mat dataset(5, 1000, arma::fill::randu);
mlpack::Save("dataset.csv", dataset);

std::cout << "Saved random data to 'dataset.csv'." << std::endl;

Among other things, the file format can be easily specified manually:

// Generate a 5-dimensional matrix of random data.
arma::mat dataset(5, 1000, arma::fill::randu);
mlpack::Save("dataset.csv", dataset, mlpack::CSV);

std::cout << "Saved random data to 'dataset.csv'." << std::endl;

See also the other examples for each supported save type:

πŸ”— Types

Support is available for loading and saving several kinds of data. Given an object X to be loaded or saved:

πŸ”— DataOptions

The Load() and Save() functions allow specifying options in a standalone manner or with an instantiated DataOptions object. Standalone options provide convenience:

// Individual standalone options can be combined with the + operator.
mlpack::Load("filename.csv", X, mlpack::CSV + mlpack::Fatal);

The use of an instantiated DataOptions (or a child class relevant to the type of data being loaded) allows more complex options to be configured and for metadata resulting from a load or save operation to be stored:

// Different data types will use DataOptions, MatrixOptions, TextOptions,
// ModelOptions, or other types.  See the documentation for each class below.
mlpack::ImageOptions opts;
opts.Channels() = 1; // Force loading in grayscale.
mlpack::Load("filename.png", X, opts);
// Now, `opts.Width()` and `opts.Height()` will store the size of the loaded
// image.

The set of allowed standalone options differs depending on the type of data being loaded or saved; if using an instantiated options object, so does the type of opts:

πŸ”— DataOptions

The DataOptions class is the base class from which all options classes specific to data types are derived. It is default-constructible and provides the .Fatal() and .Format() members.

Any members or standalone operators available in DataOptions are also available when using other options types (e.g. TextOptions, ImageOptions, etc.).

DataOptions standalone operators and members

The options below can be used as standalone operators to the Load() and Save() functions, or as calls to set members of an instantiated DataOptions object.

Standalone operator Member function Available for: Description
Load/save behavior. Β  Β  Β 
Fatal opts.Fatal() = true; All data types. A std::runtime_error will be thrown on failure.
NoFatal (default) opts.Fatal() = false; All data types. false will be returned on failure. A warning will also be printed if MLPACK_PRINT_WARN is defined.
Formats. Β  Β  Β 
AutoDetect (default) opts.Format() = mlpack::FileType::AutoDetect; All data types. The format of the file is autodetected using the extension fo the filename and (if loading) inspecting the file contents.

πŸ”— MatrixOptions

The MatrixOptions class represents options specific to matrix types (numeric and categorical data). MatrixOptions is derived from DataOptions and thus any standalone operators or member functions from DataOptions (e.g. Fatal, NoFatal, and AutoDetect) can also be used with MatrixOptions.

Note: closely related is the TextOptions class, specifically for loading numeric or categorical data from plaintext formats. MatrixOptions is used for non-plaintext numeric data formats.

MatrixOptions standalone operators and members

The options below can be used as standalone operators to the Load() and Save() functions, or as calls to set members of an instantiated MatrixOptions object.

If an option is given that does not match the type of data being loaded or saved, if Fatal() is set, then an exception will be thrown; otherwise, a warning will be printed if MLPACK_PRINT_WARN is set.

Standalone operator Member function Available for: Description
Load/save behavior. Β  Β  Β 
Transpose (default) opts.Transpose() = true; Numeric and categorical data. The matrix will be transposed to/from column-major form on load/save.
NoTranspose opts.Transpose() = false; Numeric and categorical data. The matrix will not be transposed to column-major form on load/save.
Formats. Β  Β  Β 
PGM opts.Format() = mlpack::FileType::PGMBinary; Numeric data. Load/save in the PGM image format; data should have values in the range [0, 255]. The size of the image will be the same as the size of the matrix (after any transpose is applied).
PPM opts.Format() = mlpack::FileType::PPMBinary; Numeric data. Load/save in the PPM image format; data should have values in the range [0, 255]. The size of the image will be the same as the size of the matrix (after any transpose is applied).
HDF5 opts.Format() = mlpack::FileType::HDF5Binary; Numeric data. Load/save in the HDF5 binary format; only available if Armadillo is configured with HDF5 support.
ArmaBin opts.Format() = mlpack::FileType::ArmaBinary; Numeric data. Load/save in the space-efficient arma_binary format (packed binary data).
RawBinary opts.Format() = mlpack::FileType::RawBinary; Numeric data. Load/save as packed binary data with no header and no size information; data will be loaded as a single column vector (not recommended).

πŸ”— TextOptions

The TextOptions class represents options specific to matrix types stored in plaintext formats (numeric and categorical data). TextOptions is a child class and thus any standalone operators or members from its parent classes are also available:

TextOptions standalone operators and members

The options below can be used as standalone operators to the Load() and Save() functions, or as calls to set members of an instantiated TextOptions object.

If an option is given that does not match the type of data being loaded or saved, if Fatal() is set, then an exception will be thrown; otherwise, a warning will be printed if MLPACK_PRINT_WARN is set.

Standalone operator Member function Available for: Description
Load/save behavior. Β  Β  Β 
HasHeaders opts.HasHeaders() = true; Numeric and categorical data, only for the CSV format.. If true, the first row of the file contains column names instead of data. See note below.
Categorical opts.Categorical() = true; Categorical, only for the CSV or ARFF formats. If true, the data to be loaded or saved is mixed categorical data. See note below.
SemiColon opts.SemiColon() = true; Numeric and categorical data, only for the CSV, CoordAscii, and RawAscii formats. If true, the field separator in the file is a semicolon instead of a comma.
MissingToNan opts.MissingToNan() = true; Numeric and categorical data. If true, any missing data elements will be represented as NaN instead of 0.
Formats. Β  Β  Β 
CSV opts.Format() = mlpack::FileType::CSVASCII; Numeric and categorical data. CSV or TSV format. If loading a sparse matrix and the CSV has three columns, the data is interpreted as a coordinate list.
ArmaAscii opts.Format() = mlpack::FileType::ArmaASCII; Numeric data. Space-separated values as saved by Armadillo with the arma_ascii format.
RawAscii opts.Format() = mlpack::FileType::RawASCII; Numeric data. Space-separated values or tab-separated values (TSV) with no header.
CoordAscii opts.Format() = mlpack::FileType::CoordAscii; Numeric data where X is a sparse matrix (e.g. arma::sp_mat). Coordinate list format for sparse data (see coord_ascii).
ARFF opts.Format() = mlpack::FileType::ARFFAscii; Categorical data. ARFF filetype. Used specifically to load mixed categorical dataset. See ARFF documentation. Only for loading.
Metadata. Β  Β  Β 
(n/a) opts.Headers() Numeric and categorical data. Returns a std::vector<std::string> with headers detected after loading a CSV.
(n/a) opts.DatasetInfo() Categorical data. Returns a DatasetInfo with dimension information after loading, or that will be used for dimension information during saving.

Notes:

πŸ”— ImageOptions

The ImageOptions class represents options specific to images. ImageOptions is a child class of DataOptions and thus any standalone operators or member functions from DataOptions (e.g. Fatal, NoFatal, and AutoDetect) can also be used with ImageOptions.

ImageOptions standalone operators and members

The options below can be used as standalone operators to the Load() and Save() functions, or as calls to set members of an instantiated MatrixOptions object.

If an option is given that does not match the type of data being loaded or saved, if Fatal() is set, then an exception will be thrown; otherwise, a warning will be printed if MLPACK_PRINT_WARN is set.

Standalone operator Member function Available for: Description
Formats. Β  Β  Β 
Image opts.Format() = mlpack::FileType::ImageType; Image data. Load in the image format detected by the header of the file; save in the image format specified by the filename’s extension.
PNG opts.Format() = mlpack::FileType::PNG; Image data. Load/save as a PNG image.
JPG opts.Format() = mlpack::FileType::JPG; Image data. Load/save as a JPEG image.
TGA opts.Format() = mlpack::FileType::TGA; Image data. Load/save as a TGA image.
BMP opts.Format() = mlpack::FileType::BMP; Image data. Load/save as a BMP image.
PSD opts.Format() = mlpack::FileType::PSD; Image data. Load/save as a PSD (Photoshop) image. Only for loading.
GIF opts.Format() = mlpack::FileType::GIF; Image data. Load/save as a GIF image. Only for loading.
PIC opts.Format() = mlpack::FileType::PIC; Image data. Load/save as a PIC (PICtor) image. Only for loading.
PNM opts.Format() = mlpack::FileType::PNM; Image data. Load/save as a PNM (Portable Anymap) image. Only for loading.
Save behavior. Β  Β  Β 
(n/a) opts.Quality() Image data with JPEG format. Desired JPEG quality level for saving (a size_t in the range from 0 to 100).
Metadata. Β  Β  Β 
(n/a) opts.Height() Image data Returns a size_t representing the height in pixels of the loaded image(s), or the desired height in pixels for saving.
(n/a) opts.Width() Image data Returns a size_t representing the height in pixels of the loaded image(s), or the desired height in pixels for saving.
(n/a) opts.Channels() Image data Returns a size_t representing the height in pixels of the loaded image(s), or the desired height in pixels for saving.

Notes:

πŸ”— ModelOptions

The ModelOptions class represents options specific to mlpack models and objects. ModelOptions is a child class of DataOptions and thus any standalone operators or member functions from DataOptions (e.g. Fatal, NoFatal, and AutoDetect) can also be used with ImageOptions.

ModelOptions standalone operators and members

The options below can be used as standalone operators to the Load() and Save() functions, or as calls to set members of an instantiated MatrixOptions object.

If an option is given that does not match the type of data being loaded or saved, if Fatal() is set, then an exception will be thrown; otherwise, a warning will be printed if MLPACK_PRINT_WARN is set.

Standalone operator Member function Available for: Description
Formats. Β  Β  Β 
BIN opts.Format() = mlpack::FileType::BIN; mlpack models and objects Load/save the object using an efficient packed binary format.
JSON opts.Format() = mlpack::FileType::JSON; mlpack models and objects Load/save the object using human- and machine-readable JSON.
XML opts.Format() = mlpack::FileType::XML; mlpack models and objects Load/save the object using XML (warning: may be very large).

Notes:

πŸ”— Formats

The Load() and Save() functions support numerous different formats for loading and saving. Not all formats are relevant for all types of data. The table below lists standalone options that can be used to specify the format, as well as member functions for a DataOptions object.

When AutoDetect (the default) is specified as the format, the actual file format is auto-detected using the filename’s extension and (if loading) inspecting the file contents. Accepted filename extensions for each type are given in the table.

Standalone operator Member function Filename extensions Available for: Description
AutoDetect (default) opts.Format() = FileType::AutoDetect (n/a) All data types. The format of the file is autodetected as one of the formats below.
CSV opts.Format() = FileType::CSVASCII; .csv, .tsv Numeric and categorical data CSV or TSV format. If loading a sparse matrix and the CSV has three columns, the data is interpreted as a coordinate list.
ArmaAscii opts.Format() = FileType::ArmaASCII; .txt, .csv Numeric data Space-separated values as saved by Armadillo with the arma_ascii format.
RawAscii opts.Format() = FileType::RawASCII; .txt Numeric data Space-separated values or tab-separated values (TSV) with no header.
CoordAscii opts.Format() = FileType::CoordAscii; .txt (if X is sparse) Numeric data where X is a sparse matrix (e.g. arma::sp_mat). Coordinate list format for sparse data (see coord_ascii).
ARFF opts.Format() = FileType::ARFFAscii; .arff Categorical data ARFF filetype. Used specifically to load mixed categorical dataset. See ARFF documentation. Only for loading.
PGM opts.Format() = FileType::PGMBinary; .pgm Numeric data Load/save in the PGM image format; data should have values in the range [0, 255]. The size of the image will be the same as the size of the matrix (after any transpose is applied).
PPM opts.Format() = FileType::PPMBinary; .ppm Numeric data Load/save in the PPM image format; data should have values in the range [0, 255]. The size of the image will be the same as the size of the matrix (after any transpose is applied).
HDF5 opts.Format() = FileType::HDF5Binary; .h5, .hdf5, .hdf, .he5 Numeric data Load/save in the HDF5 binary format; only available if Armadillo is configured with HDF5 support.
ArmaBin opts.Format() = FileType::ArmaBinary; .bin (if X is an Armadillo type) Numeric data Load/save in the space-efficient arma_binary format (packed binary data).
RawBinary opts.Format() = FileType::RawBinary; Β  Numeric data Load/save as packed binary data with no header and no size information; data will be loaded as a single column vector (not recommended).
Image opts.Format() = FileType::ImageType (n/a) Image data Load in the image format detected by the header of the file; save in the image format specified by the filename’s extension.
PNG opts.Format() = FileType::PNG .png Image data Load/save as a PNG image.
JPG opts.Format() = FileType::JPG .jpg, .jpeg Image data Load/save as a JPEG image.
TGA opts.Format() = FileType::TGA .tga Image data Load/save as a TGA image.
BMP opts.Format() = FileType::BMP .bmp Image data Load/save as a BMP image.
PSD opts.Format() = FileType::PSD .psd Image data Load/save as a PSD (Photoshop) image. Only for loading.
GIF opts.Format() = FileType::GIF .gif Image data Load/save as a GIF image. Only for loading.
PIC opts.Format() = FileType::PIC .pic Image data Load/save as a PIC (PICtor) image. Only for loading.
PNM opts.Format() = FileType::PNM .pnm Image data Load/save as a PNM (Portable Anymap) image. Only for loading.
BIN opts.Format() = FileType::BIN .bin mlpack models and objects Load/save the object using an efficient packed binary format.
JSON opts.Format() = FileType::JSON .json mlpack models and objects Load/save the object using human- and machine-readable JSON.
XML opts.Format() = FileType::XML .xml mlpack models and objects Load/save the object using XML (warning: may be very large).

πŸ”— Numeric data

Standard numeric data is represented in mlpack as a column-major matrix and a variety of formats for loading and saving are supported.

πŸ”— Numeric data load/save examples

Load two datasets, print information about them, modify them, and save them back to disk.

// Throw an exception if loading fails with the Fatal option.

// See https://datasets.mlpack.org/satellite.train.csv.
arma::mat dataset;
mlpack::Load("satellite.train.csv", dataset, mlpack::Fatal);

// See https://datasets.mlpack.org/satellite.train.labels.csv.
arma::Row<size_t> labels;
mlpack::Load("satellite.train.labels.csv", labels, mlpack::Fatal);

// Print information about the data.
std::cout << "The data in 'satellite.train.csv' has: " << std::endl;
std::cout << " - " << dataset.n_cols << " points." << std::endl;
std::cout << " - " << dataset.n_rows << " dimensions." << std::endl;

std::cout << "The labels in 'satellite.train.labels.csv' have: " << std::endl;
std::cout << " - " << labels.n_elem << " labels." << std::endl;
std::cout << " - A maximum label of " << labels.max() << "." << std::endl;
std::cout << " - A minimum label of " << labels.min() << "." << std::endl;

// Modify and save the data.  Add 2 to the data and drop the last column.
dataset += 2;
dataset.shed_col(dataset.n_cols - 1);
labels.shed_col(labels.n_cols - 1);

// Don't throw an exception if saving fails.  Technically there is no need to
// explicitly specify NoFatal---it is the default.
mlpack::Save("satellite.train.mod.csv", dataset, mlpack::NoFatal);
mlpack::Save("satellite.train.labels.mod.csv", labels, mlpack::NoFatal);

Load a dataset stored in a binary format and save it as a CSV.

// See https://datasets.mlpack.org/iris.bin.
arma::mat dataset;
mlpack::Load("iris.bin",
    dataset, mlpack::Fatal + mlpack::ArmaBin);

// Save it back to disk as a CSV.
mlpack::Save("iris.converted.csv", dataset, mlpack::CSV);

Load a dataset that has a semicolon as a separator instead of a comma.

// First write the semicolon file to disk.
std::fstream f;
f.open("semicolon.csv", std::fstream::out);
f << "1; 2; 3; 4" << std::endl;
f << "5; 6; 7; 8" << std::endl;
f << "9; 10; 11; 12" << std::endl;

// Now create a TextOptions and specify that the separator is a semicolon.
// Since all of the elements are integers, we load into an `arma::umat` (a
// matrix that holds unsigned integers) instead of an `arma::mat`.
arma::umat dataset;
mlpack::TextOptions opts;
opts.Semicolon() = true;

// Note that instead of `opts` we could just specify `Semicolon` instead!
mlpack::Load("semicolon.csv", dataset, opts);
std::cout << "The data in 'semicolon.csv' has: " << std::endl;
std::cout << " - " << dataset.n_cols << " points." << std::endl;
std::cout << " - " << dataset.n_rows << " dimensions." << std::endl;

Load a dataset with missing elements, and replace the missing elements with NaN using the MissingToNan option.

// First write a CSV file to disk with some missing values.
std::fstream f;
f.open("missing_to_nan.csv", std::fstream::out);
// Missing 2 value in the first row.
f << "1, , 3, 4" << std::endl;
f << "5, 6, 7, 8" << std::endl;
f << "9, 10, 11, 12" << std::endl;

arma::mat dataset;
mlpack::TextOptions opts;
opts.MissingToNan() = true;

// Note that instead of `opts` we could just specify `MissingToNan` instead!
mlpack::Load("missing_to_nan.csv", dataset, opts);
// Print information about the data.
std::cout << "Loaded data:" << std::endl;
std::cout << dataset;

Load a CSV into a 32-bit floating point matrix and print the headers (column names).

// See https://datasets.mlpack.org/Admission_Predict.csv.
arma::fmat dataset;
// We have to make a TextOptions object so that we can recover the headers.
mlpack::TextOptions opts;
opts.Format() = mlpack::FileType::CSVASCII;
opts.HasHeaders() = true;
mlpack::Load("Admission_Predict.csv", dataset, opts);

std::cout << "Found " << opts.Headers().size() << " columns." << std::endl;
for (size_t i = 0; i < opts.Headers().size(); ++i)
{
  std::cout << " - Column " << i << ": '" << opts.Headers()[i] << "'."
      << std::endl;
}

Load a CSV containing a coordinate list into a sparse matrix and print the overall size of the loaded matrix.

// See https://datasets.mlpack.org/movielens-100k.csv.
arma::sp_mat dataset;
// A 3-column CSV into a sparse matrix is interpreted as a coordinate list.
mlpack::Load("movielens-100k.csv", dataset, mlpack::CSV);

std::cout << "Loaded data from movielens-100k.csv; matrix size: "
    << dataset.n_rows << " x " << dataset.n_cols << "." << std::endl;

πŸ”— Mixed categorical data

mlpack supports mixed categorical data, e.g., data where some dimensions take only categorical values (e.g. 0, 1, 2, etc.). When using mlpack, string data and other non-numerical data must be mapped to categorical values and represented as part of an arma::mat or other matrix type. Category metadata is stored in an auxiliary DatasetInfo object.

Categorical data is supported by a number of mlpack algorithms, including DecisionTree, HoeffdingTree, and RandomForest.

πŸ”— DatasetInfo

mlpack represents categorical data via the use of the auxiliary DatasetInfo object, which stores information about which dimensions are numeric or categorical and allows conversion from the original category values to the numeric values used to represent those categories.

For loading and saving categorical data, an instantiated TextOptions must be passed to Load() or Save(); this object contains a DatasetInfo object, accessible via the .DatasetInfo() method; e.g., opts.DatasetInfo().

Accessing and setting properties

This documentation uses info as the name of the DatasetInfo object, but if a categorical dataset has been loaded with Load(), it is instead suggested to use opts.DatasetInfo() in place of info.


Map to and from numeric values


πŸ”— Categorical data load/save examples

Load and manipulate an ARFF file.

// Load a categorical dataset.
arma::mat dataset;

// Define a TextOptions to load categorical data.
mlpack::TextOptions opts;
opts.Fatal() = true;
opts.Categorical() = true;

// See https://datasets.mlpack.org/covertype.train.arff.
mlpack::Load("covertype.train.arff", dataset, opts);

// Print information about the data.
std::cout << "The data in 'covertype.train.arff' has: " << std::endl;
std::cout << " - " << dataset.n_cols << " points." << std::endl;
std::cout << " - " << opts.DatasetInfo().Dimensionality() << " dimensions."
    << std::endl;

arma::Row<size_t> labels;
// We need to have a second options, since we are loading two different
// data types and extension.
mlpack::TextOptions labelOpts;
labelOpts.Fatal() = true;
// See https://datasets.mlpack.org/covertype.train.labels.csv.
mlpack::Load("covertype.train.labels.csv", labels, labelOpts);


// Print information about each dimension.
for (size_t d = 0; d < opts.DatasetInfo().Dimensionality(); ++d)
{
  if (opts.DatasetInfo().Type(d) == mlpack::Datatype::categorical)
  {
    std::cout << " - Dimension " << d << " is categorical with "
        << opts.DatasetInfo().NumMappings(d) << " categories." << std::endl;
  }
  else
  {
    std::cout << " - Dimension " << d << " is numeric." << std::endl;
  }
}

// Modify the 5th point.  Increment any numeric values, and set any categorical
// values to the string "hooray!".
for (size_t d = 0; d < opts.DatasetInfo().Dimensionality(); ++d)
{
  if (opts.DatasetInfo().Type(d) == mlpack::Datatype::categorical)
  {
    // This will create a new mapping if the string "hooray!" does not already
    // exist as a category for dimension d..
    dataset(d, 4) = opts.DatasetInfo().MapString<double>("hooray!", d);
  }
  else
  {
    dataset(d, 4) += 1.0;
  }
}

Manually create a DatasetInfo object and use it to save a categorical dataset.

// This will manually create the following data matrix (shown as it would appear
// in a CSV):
//
// 1, TRUE, "good", 7.0, 4
// 2, FALSE, "good", 5.6, 3
// 3, FALSE, "bad", 6.1, 4
// 4, TRUE, "bad", 6.1, 1
// 5, TRUE, "unknown", 6.3, 0
// 6, FALSE, "unknown", 5.1, 2
//
// Although the last dimension is numeric, we will take it as a categorical
// dimension.

arma::mat dataset(5, 6); // 6 data points in 5 dimensions.
mlpack::DatasetInfo info(5);

// Set types of dimensions.  By default they are numeric so we only set
// categorical dimensions.
info.Type(1) = mlpack::Datatype::categorical;
info.Type(2) = mlpack::Datatype::categorical;
info.Type(4) = mlpack::Datatype::categorical;

// The first dimension is numeric.
dataset(0, 0) = 1;
dataset(0, 1) = 2;
dataset(0, 2) = 3;
dataset(0, 3) = 4;
dataset(0, 4) = 5;
dataset(0, 5) = 6;

// The second dimension is categorical.
dataset(1, 0) = info.MapString<double>("TRUE", 1);
dataset(1, 1) = info.MapString<double>("FALSE", 1);
dataset(1, 2) = info.MapString<double>("FALSE", 1);
dataset(1, 3) = info.MapString<double>("TRUE", 1);
dataset(1, 4) = info.MapString<double>("TRUE", 1);
dataset(1, 5) = info.MapString<double>("FALSE", 1);

// The third dimension is categorical.
dataset(2, 0) = info.MapString<double>("good", 2);
dataset(2, 1) = info.MapString<double>("good", 2);
dataset(2, 2) = info.MapString<double>("bad", 2);
dataset(2, 3) = info.MapString<double>("bad", 2);
dataset(2, 4) = info.MapString<double>("unknown", 2);
dataset(2, 5) = info.MapString<double>("unknown", 2);

// The fourth dimension is numeric.
dataset(3, 0) = 7.0;
dataset(3, 1) = 5.6;
dataset(3, 2) = 6.1;
dataset(3, 3) = 6.1;
dataset(3, 4) = 6.3;
dataset(3, 5) = 5.1;

// The fifth dimension is categorical.  Note that `info` will choose to assign
// category values in the order they are seen, even if the category can be
// parsed as a number.  So, here, the value '4' will be assigned category '0',
// since it is seen first.
dataset(4, 0) = info.MapString<double>("4", 4);
dataset(4, 1) = info.MapString<double>("3", 4);
dataset(4, 2) = info.MapString<double>("4", 4);
dataset(4, 3) = info.MapString<double>("1", 4);
dataset(4, 4) = info.MapString<double>("0", 4);
dataset(4, 5) = info.MapString<double>("2", 4);

// Print the dataset with mapped categories.
dataset.print("Dataset with mapped categories");

// Print the mappings for the third dimension.
std::cout << "Mappings for dimension 3: " << std::endl;
for (size_t i = 0; i < info.NumMappings(2); ++i)
{
  std::cout << " - \"" << info.UnmapString(i, 2) << "\" maps to " << i << "."
      << std::endl;
}

// Now `dataset` is ready for use with an mlpack algorithm that supports
// categorical data.  We will save it to `categorical-data.csv`.
mlpack::TextOptions opts;
opts.Categorical() = true;
opts.DatasetInfo() = std::move(info);
mlpack::Save("categorical-data.csv", dataset, opts);

πŸ”— Image data

mlpack load, saves, and modifies image data using the STB library. STB is a header-only library that is bundled with mlpack; but, it is also possible to use a version of STB available on the system.

When loading images, each image is represented as a flattened single column vector in a data matrix; each row of the resulting vector will correspond to a single pixel value (between 0 and 255) in a single channel. If an ImageOptions was passed to Load(), it will be populated with the metadata of the image.

Images are flattened along rows, with channel values interleaved, starting from the top left. Thus, the value of the pixel at position (x, y) in channel c will be contained in element/row y * (channels) + x * (width * channels) + c of the flattened vector.


When working with images, the following overload for Save() is also available:

Note: when loading and saving images, if the element type of X is not unsigned char (e.g. if image is not arma::Mat<unsigned char>, when loading, the data will be temporarily loaded as unsigned chars and then converted, and when saving, X will be converted to unsigned chars before saving.

πŸ”— Image data load/save examples

Load a single image, but don’t store the metadata (so, e.g., height, width, and number of channels are unavailable after loading!).

// See https://www.mlpack.org/static/img/numfocus-logo.png.
arma::mat image;
mlpack::Load("numfocus-logo.png", image, PNG);

// If we wanted image metadata, we would need to pass an ImageOptions.  See the
// next example.
//
// We could also specify `Image` instead of `PNG` if we did not care which image
// format was used, but just that *some* image format was used.

std::cout << "The image in 'numfocus-logo.png' has " << image.n_rows
    << " pixels." << std::endl;

Load and save a single image:

// See https://www.mlpack.org/static/img/numfocus-logo.png.
mlpack::ImageOptions opts;
opts.Fatal() = true;
arma::mat matrix;
mlpack::Load("numfocus-logo.png", matrix, opts /* format autodetected */);

// `matrix` should now contain one column.

// Print information about the image.
std::cout << "Information about the image in 'numfocus-logo.png': "
    << std::endl;
std::cout << " - " << opts.Width() << " pixels in width." << std::endl;
std::cout << " - " << opts.Height() << " pixels in height." << std::endl;
std::cout << " - " << opts.Channels() << " color channels." << std::endl;

std::cout << "Value at pixel (x=3, y=4) in the first channel: ";
const size_t index = (4 * opts.Width() * opts.Channels()) +
    (3 * opts.Channels());
std::cout << matrix[index] << "." << std::endl;

// Increment each pixel value, but make sure they are still within the bounds.
matrix += 1;
matrix.clamp(0, 255);

mlpack::Save("numfocus-logo-mod.png", matrix, opts);

Load and save multiple images:

// Load some favicons from websites associated with mlpack.
std::vector<std::string> images;
// See the following files:
// - https://datasets.mlpack.org/images/mlpack-favicon.png
// - https://datasets.mlpack.org/images/ensmallen-favicon.png
// - https://datasets.mlpack.org/images/armadillo-favicon.png
// - https://datasets.mlpack.org/images/bandicoot-favicon.png
images.push_back("mlpack-favicon.png");
images.push_back("ensmallen-favicon.png");
images.push_back("armadillo-favicon.png");
images.push_back("bandicoot-favicon.png");

mlpack::ImageOptions opts;
opts.Channels() = 1; // Force loading in grayscale.
opts.Fatal() = true;

arma::mat matrix;
mlpack::Load(images, matrix, opts);

// Print information about what we loaded.
std::cout << "Loaded " << matrix.n_cols << " images.  Images are of size "
    << opts.Width() << " x " << opts.Height() << " with " << opts.Channels()
    << " color channel." << std::endl;

// Invert images.
matrix = (255.0 - matrix);

// Save as compressed JPEGs with low quality.
opts.Quality() = 75;
std::vector<std::string> outImages;
outImages.push_back("mlpack-favicon-inv.jpeg");
outImages.push_back("ensmallen-favicon-inv.jpeg");
outImages.push_back("armadillo-favicon-inv.jpeg");
outImages.push_back("bandicoot-favicon-inv.jpeg");

mlpack::Save(outImages, matrix, opts);

πŸ”— mlpack models and objects

Machine learning models and any mlpack object (i.e. anything in the mlpack:: namespace) can be saved with Save() and loaded with Load(). Serialization is performed using the cereal serialization toolkit.

Note: when loading an object that was saved in the binary format (BIN), the C++ type of the object must be exactly the same (including template parameters) as the type used to save the object. If not, undefined behavior will occurβ€”most likely a crash.

πŸ”— mlpack models and objects load/save examples

Simple example: create a math::Range object, then save and load it.

mlpack::math::Range r(3.0, 6.0);

// How we can use DataOptions with loading / saving objects.
mlpack::DataOptions opts;
opts.Fatal() = true;
opts.Format() = mlpack::FileType::BIN;

// Save the Range to 'range.bin', using the name "range".
mlpack::Save("range.bin", r, opts);

// Load the range into a new object.
mlpack::math::Range r2;
mlpack::Load("range.bin", r2, mlpack::BIN + mlpack::Fatal);

std::cout << "Loaded range: [" << r2.Lo() << ", " << r2.Hi() << "]."
    << std::endl;

// Modify and save the range as JSON.
r2.Lo() = 4.0;
mlpack::Save("range.json", r2, mlpack::JSON + mlpack::Fatal);

// Now 'range.json' will contain the following:
//
// {
//     "range": {
//         "cereal_class_version": 0,
//         "hi": 6.0,
//         "lo": 4.0
//     }
// }

Train a LinearRegression model and save it to disk, then reload it.

// See https://datasets.mlpack.org/admission_predict.csv.
arma::mat data;
mlpack::Load("admission_predict.csv", data, mlpack::NoFatal);

// See https://datasets.mlpack.org/admission_predict.responses.csv.
arma::rowvec responses;
mlpack::Load("admission_predict.responses.csv", responses, mlpack::Fatal);

// Train a linear regression model, fitting an intercept term and using an L2
// regularization parameter of 0.3.
mlpack::LinearRegression lr(data, responses, 0.3, true);

// Save the model using the binary format as a standalone parameter, throwing an
// exception on failure.
mlpack::Save("lr-model.bin", lr, mlpack::Fatal + mlpack::BIN);
std::cout << "Saved model to lr-model.bin." << std::endl;

// Now load the model back, using format autodetection on the filename
// extension.
mlpack::LinearRegression loadedModel;
if (!mlpack::Load("lr-model.bin", loadedModel))
{
  std::cout << "Model not loaded successfully from 'lr-model.bin'!"
      << std::endl;
}
else
{
  std::cout << "Model loaded successfully from 'lr-model.bin' with "
      << "intercept value of " << loadedModel.Parameters()[0] << "."
      << std::endl;
}