Data loading and I/O
mlpack provides the Load() and Save() functions to load and save
Armadillo matrices (e.g. numeric and categorical datasets) and
any mlpack object via the cereal
serialization toolkit. A number of other utilities related to loading and
saving data and objects are also available. The Load() and
Save() functions have numerous options to configure load/save behavior
and format detection/selection.
π Load()
Load(filename, X)- Load
Xfrom the given filefilenamewith default options:- the format of the file is auto-detected based on the extension of the file, and
- an exception is not thrown on an error.
- Returns a
boolindicating whether the load was a success. Xcan be any supported load type.
- Load
Load(filename, object, Option1 + Option2 + ...)- Load
Xfrom the given filefilenamewith the given options. - Returns a
boolindicating whether the load was a success. Xcan be any supported load type.- The given options must be from the
list of standalone operators and be appropriate for the type
of
X.
- Load
Load(filename, object, opts)- Load
Xfrom the given filefilenamewith the given options specified inopts. - Returns a
boolindicating whether the load was a success. Xcan be any supported load type.optsis aDataOptionsobject whose subtype matches the type ofX.
- Load
For some types of data, it is also possible to load multiple images at once from a set of files:
Load(filenames, X)Load(filenames, X, Option1 + Option2 + ...)Load(filenames, X, opts)- Load data from
filenames(astd::vector<std::string>) into the matrixX.- For numeric data, data loaded from each file is concatenated into
X. - For image data, each image is flattened into one column of
X.
- For numeric data, data loaded from each file is concatenated into
- Metadata (e.g. image size, number of columns, etc.) in all files in
filenamesmust match or loading will fail. - Loading options can be specified by either standalone options or an instantiated
DataOptionsobject.
- Load data from
Simple example:
// See https://datasets.mlpack.org/iris.csv.
arma::mat x;
mlpack::Load("iris.csv", x);
std::cout << "Loaded iris.csv; size " << x.n_rows << " x " << x.n_cols << "."
<< std::endl;
Among other things, the file format can be easily specified:
// See https://datasets.mlpack.org/iris.csv.
arma::mat x;
mlpack::Load("iris.csv", x, mlpack::CSV);
std::cout << "Loaded iris.csv; size " << x.n_rows << " x " << x.n_cols << "."
<< std::endl;
See also the other examples for each supported load type:
π Save()
Save(filename, X)- Save
Xto the given filefilenamewith default options:- the format of the file is auto-detected based on the extension of the file, and
- an exception is not thrown on an error.
- Returns a
boolindicating whether the save was a success. Xcan be any supported save type.
- Save
Save(filename, object, Option1 + Option2 + ...)- Save
Xto the given filefilenamewith the given options. - Returns a
boolindicating whether the save was a success. Xcan be any supported save type.- The given options must be from the
list of standalone options and be appropriate for the type
of
X.
- Save
Save(filename, object, opts)- Save
Xto the given filefilenamewith the given options specified inopts. - Returns a
boolindicating whether the save was a success. Xcan be any supported save type.optsis aDataOptionsobject whose subtype matches the type ofX.
- Save
Note: when saving images, it is possible to save
into multiple images from one matrix X. See image data for more
details.
Simple example:
// Generate a 5-dimensional matrix of random data.
arma::mat dataset(5, 1000, arma::fill::randu);
mlpack::Save("dataset.csv", dataset);
std::cout << "Saved random data to 'dataset.csv'." << std::endl;
Among other things, the file format can be easily specified manually:
// Generate a 5-dimensional matrix of random data.
arma::mat dataset(5, 1000, arma::fill::randu);
mlpack::Save("dataset.csv", dataset, mlpack::CSV);
std::cout << "Saved random data to 'dataset.csv'." << std::endl;
See also the other examples for each supported save type:
π Types
Support is available for loading and saving several kinds of data. Given an
object X to be loaded or saved:
- For numeric data,
Xshould have typearma::mator any supported matrix type (e.g.arma::fmat,arma::umat, etc.).- Supported formats are CSV, TSV, text, binary, ARFF, and others; see the table of format options.
- Additional options can be specified with a
DataOptions,MatrixOptions, orTextOptionsobject. - See numeric data examples for example usage.
- For mixed categorical data (data where not
all columns are numeric),
Xshould have typearma::mator any supported matrix type (e.g.arma::fmat,arma::umat, etc.).- Columns of
Xthat are categorical are represented as integer values starting from 0. - Information about categorical dimensions is stored in a
DatasetInfoobject, which is held inside of aTextOptionsobject. - Supported formats are CSV, TSV, text, and ARFF; see the table of format options.
- See categorical data examples for example usage.
- For image data,
Xshould have typearma::mator any supported matrix type (e.g.arma::fmat,arma::umat, etc.).- Images are represented in a vectorized form; see image data for details.
- An
ImageOptionsobject is used for representing metadata specific to image formats. - Supported formats are PNG, JPEG, TGA, BMP, PSD, GIF, PIC, and PNM; see the table of format options.
- See image data examples for example usage.
- For mlpack models and objects,
Xcan have type equivalent to any mlpack class or type (e.g.mlpack::RandomForest,mlpack::KDTree,mlpack::Range, etc.).- Supported formats for model/object serialization are binary, text, and JSON; see the table of format options.
- See mlpack model and object examples for example usage.
π DataOptions
The Load() and Save() functions allow
specifying options in a standalone manner or with an instantiated DataOptions
object. Standalone options provide convenience:
// Individual standalone options can be combined with the + operator.
mlpack::Load("filename.csv", X, mlpack::CSV + mlpack::Fatal);
The use of an instantiated DataOptions (or a child class relevant to the type
of data being loaded) allows more complex options to be configured and for
metadata resulting from a load or save operation to be stored:
// Different data types will use DataOptions, MatrixOptions, TextOptions,
// ModelOptions, or other types. See the documentation for each class below.
mlpack::ImageOptions opts;
opts.Channels() = 1; // Force loading in grayscale.
mlpack::Load("filename.png", X, opts);
// Now, `opts.Width()` and `opts.Height()` will store the size of the loaded
// image.
The set of allowed standalone options differs depending on the
type of data being loaded or saved; if using an instantiated options
object, so does the type of opts:
- Numeric data:
MatrixOptionsand its standalone options, orTextOptionsand its standalone options for plaintext formats; - Mixed categorical data:
TextOptionsand its standalone options; - Image data:
ImageOptionsand its standalone options; - mlpack models and objects:
ModelOptionsand its standalone options.
π DataOptions
The DataOptions class is the base class from which all options classes
specific to data types are derived. It is default-constructible and
provides the .Fatal() and .Format() members.
Any members or standalone operators available in DataOptions are also
available when using other options types (e.g. TextOptions,
ImageOptions, etc.).
DataOptions standalone operators and members
The options below can be used as standalone operators to the
Load() and Save() functions, or as
calls to set members of an instantiated DataOptions object.
| Standalone operator | Member function | Available for: | Description |
|---|---|---|---|
| Load/save behavior. | Β | Β | Β |
Fatal |
opts.Fatal() = true; |
All data types. | A std::runtime_error will be thrown on failure. |
NoFatal (default) |
opts.Fatal() = false; |
All data types. | false will be returned on failure. A warning will also be printed if MLPACK_PRINT_WARN is defined. |
| Formats. | Β | Β | Β |
AutoDetect (default) |
opts.Format() = mlpack::FileType::AutoDetect; |
All data types. | The format of the file is autodetected using the extension fo the filename and (if loading) inspecting the file contents. |
π MatrixOptions
The MatrixOptions class represents options specific to matrix types
(numeric and categorical data).
MatrixOptions is derived from DataOptions and thus any
standalone operators or member functions from DataOptions
(e.g. Fatal, NoFatal, and AutoDetect) can also be used with
MatrixOptions.
Note: closely related is the TextOptions class,
specifically for loading numeric or categorical data from plaintext formats.
MatrixOptions is used for non-plaintext numeric data formats.
MatrixOptions standalone operators and members
The options below can be used as standalone operators to the
Load() and Save() functions, or as
calls to set members of an instantiated MatrixOptions object.
If an option is given that does not match the type of data being loaded or
saved, if Fatal() is set,
then an exception will be thrown; otherwise, a warning will be printed if
MLPACK_PRINT_WARN
is set.
| Standalone operator | Member function | Available for: | Description |
|---|---|---|---|
| Load/save behavior. | Β | Β | Β |
Transpose (default) |
opts.Transpose() = true; |
Numeric and categorical data. | The matrix will be transposed to/from column-major form on load/save. |
NoTranspose |
opts.Transpose() = false; |
Numeric and categorical data. | The matrix will not be transposed to column-major form on load/save. |
| Formats. | Β | Β | Β |
PGM |
opts.Format() = mlpack::FileType::PGMBinary; |
Numeric data. | Load/save in the PGM image format; data should have values in the range [0, 255]. The size of the image will be the same as the size of the matrix (after any transpose is applied). |
PPM |
opts.Format() = mlpack::FileType::PPMBinary; |
Numeric data. | Load/save in the PPM image format; data should have values in the range [0, 255]. The size of the image will be the same as the size of the matrix (after any transpose is applied). |
HDF5 |
opts.Format() = mlpack::FileType::HDF5Binary; |
Numeric data. | Load/save in the HDF5 binary format; only available if Armadillo is configured with HDF5 support. |
ArmaBin |
opts.Format() = mlpack::FileType::ArmaBinary; |
Numeric data. | Load/save in the space-efficient arma_binary format (packed binary data). |
RawBinary |
opts.Format() = mlpack::FileType::RawBinary; |
Numeric data. | Load/save as packed binary data with no header and no size information; data will be loaded as a single column vector (not recommended). |
π TextOptions
The TextOptions class represents options specific to matrix types stored in
plaintext formats (numeric and categorical
data). TextOptions is a child class and thus any standalone operators or members from its parent classes are also available:
DataOptionsprovides:Fatal,NoFatal, andAutoDetectstandalone operatorsopts.Fatal()andopts.Format()members- See the
DataOptionsoperator and member documentation
MatrixOptionsprovides:Transpose,NoTranspose,PGM,PPM,HDF5,ArmaBin, andRawBinarystandalone operatorsopts.Transpose()member- See the
MatrixOptionsoperator and member documentation
TextOptions standalone operators and members
The options below can be used as standalone operators to the
Load() and Save() functions, or as
calls to set members of an instantiated TextOptions object.
If an option is given that does not match the type of data being loaded or
saved, if Fatal() is set,
then an exception will be thrown; otherwise, a warning will be printed if
MLPACK_PRINT_WARN
is set.
| Standalone operator | Member function | Available for: | Description |
|---|---|---|---|
| Load/save behavior. | Β | Β | Β |
HasHeaders |
opts.HasHeaders() = true; |
Numeric and categorical data, only for the CSV format.. | If true, the first row of the file contains column names instead of data. See note below. |
Categorical |
opts.Categorical() = true; |
Categorical, only for the CSV or ARFF formats. | If true, the data to be loaded or saved is mixed categorical data. See note below. |
SemiColon |
opts.SemiColon() = true; |
Numeric and categorical data, only for the CSV, CoordAscii, and RawAscii formats. |
If true, the field separator in the file is a semicolon instead of a comma. |
MissingToNan |
opts.MissingToNan() = true; |
Numeric and categorical data. | If true, any missing data elements will be represented as NaN instead of 0. |
| Formats. | Β | Β | Β |
CSV |
opts.Format() = mlpack::FileType::CSVASCII; |
Numeric and categorical data. | CSV or TSV format. If loading a sparse matrix and the CSV has three columns, the data is interpreted as a coordinate list. |
ArmaAscii |
opts.Format() = mlpack::FileType::ArmaASCII; |
Numeric data. | Space-separated values as saved by Armadillo with the arma_ascii format. |
RawAscii |
opts.Format() = mlpack::FileType::RawASCII; |
Numeric data. | Space-separated values or tab-separated values (TSV) with no header. |
CoordAscii |
opts.Format() = mlpack::FileType::CoordAscii; |
Numeric data where X is a sparse matrix (e.g. arma::sp_mat). |
Coordinate list format for sparse data (see coord_ascii). |
ARFF |
opts.Format() = mlpack::FileType::ARFFAscii; |
Categorical data. | ARFF filetype. Used specifically to load mixed categorical dataset. See ARFF documentation. Only for loading. |
| Metadata. | Β | Β | Β |
| (n/a) | opts.Headers() |
Numeric and categorical data. | Returns a std::vector<std::string> with headers detected after loading a CSV. |
| (n/a) | opts.DatasetInfo() |
Categorical data. | Returns a DatasetInfo with dimension information after loading, or that will be used for dimension information during saving. |
Notes:
-
When
opts.HasHeaders()istruewhile loading, the parsed headers from the CSV file are stored into theopts.Headers()member, which has typestd::vector<std::string>. In order to access the headers after loading, an instantiatedTextOptionsmust be passed toLoad(); ifHasHeadersis passed as a standalone option, the parsed headers will not be accessible after loading. -
When
opts.Categorical()istruewhile loading with the CSV format, any fields where a value cannot be interpreted as numeric will be automatically converted to a categorical dimension with values between0and the number of unique values in the field/dimension. See categorical data for more information on this representation. -
When
opts.Categorical()istruewhile loading, aDatasetInfooption is populated with information about each of the dimensions in the dataset and stored inopts.DatasetInfo(). In order to access this after loading, an instantiatedTextOptionsmust be passed toLoad(); ifCategoricalis passed as a standalone option, theDatasetInfoobject will not be accessible after loading. -
When
opts.Categorical()istruewhile saving, the values inopts.DatasetInfo()(which has typeDatasetInfo) will be used to map any categorical dimensions back to their original values. IfCategoricalwas passed as a standalone option, then noDatasetInfocan be set before saving, and all dimensions of the data will be saved as numeric data.
π ImageOptions
The ImageOptions class represents options specific to images.
ImageOptions is a child class of DataOptions and thus any
standalone operators or member functions from DataOptions
(e.g. Fatal, NoFatal, and AutoDetect) can also be used with
ImageOptions.
ImageOptions standalone operators and members
The options below can be used as standalone operators to the
Load() and Save() functions, or as
calls to set members of an instantiated MatrixOptions object.
If an option is given that does not match the type of data being loaded or
saved, if Fatal() is set,
then an exception will be thrown; otherwise, a warning will be printed if
MLPACK_PRINT_WARN
is set.
| Standalone operator | Member function | Available for: | Description |
|---|---|---|---|
| Formats. | Β | Β | Β |
Image |
opts.Format() = mlpack::FileType::ImageType; |
Image data. | Load in the image format detected by the header of the file; save in the image format specified by the filenameβs extension. |
PNG |
opts.Format() = mlpack::FileType::PNG; |
Image data. | Load/save as a PNG image. |
JPG |
opts.Format() = mlpack::FileType::JPG; |
Image data. | Load/save as a JPEG image. |
TGA |
opts.Format() = mlpack::FileType::TGA; |
Image data. | Load/save as a TGA image. |
BMP |
opts.Format() = mlpack::FileType::BMP; |
Image data. | Load/save as a BMP image. |
PSD |
opts.Format() = mlpack::FileType::PSD; |
Image data. | Load/save as a PSD (Photoshop) image. Only for loading. |
GIF |
opts.Format() = mlpack::FileType::GIF; |
Image data. | Load/save as a GIF image. Only for loading. |
PIC |
opts.Format() = mlpack::FileType::PIC; |
Image data. | Load/save as a PIC (PICtor) image. Only for loading. |
PNM |
opts.Format() = mlpack::FileType::PNM; |
Image data. | Load/save as a PNM (Portable Anymap) image. Only for loading. |
| Save behavior. | Β | Β | Β |
| (n/a) | opts.Quality() |
Image data with JPEG format. | Desired JPEG quality level for saving (a size_t in the range from 0 to 100). |
| Metadata. | Β | Β | Β |
| (n/a) | opts.Height() |
Image data | Returns a size_t representing the height in pixels of the loaded image(s), or the desired height in pixels for saving. |
| (n/a) | opts.Width() |
Image data | Returns a size_t representing the height in pixels of the loaded image(s), or the desired height in pixels for saving. |
| (n/a) | opts.Channels() |
Image data | Returns a size_t representing the height in pixels of the loaded image(s), or the desired height in pixels for saving. |
Notes:
-
After a call to
Load(), if an instantiatedImageOptionswas passed, theopts.Height(),opts.Width(), andopts.Channels()members will be set with the values found during loading. -
Before calling
Load(), the value ofopts.Channels()can be set to the desired number of channels (1/3/4) to force loading with that many color channels. -
The
opts.Quality()option is only relevant when callingSave()when using theJPGformat.
π ModelOptions
The ModelOptions class represents options specific to
mlpack models and objects. ModelOptions is a
child class of DataOptions and thus any
standalone operators or member functions from DataOptions
(e.g. Fatal, NoFatal, and AutoDetect) can also be used with
ImageOptions.
ModelOptions standalone operators and members
The options below can be used as standalone operators to the
Load() and Save() functions, or as
calls to set members of an instantiated MatrixOptions object.
If an option is given that does not match the type of data being loaded or
saved, if Fatal() is set,
then an exception will be thrown; otherwise, a warning will be printed if
MLPACK_PRINT_WARN
is set.
| Standalone operator | Member function | Available for: | Description |
|---|---|---|---|
| Formats. | Β | Β | Β |
BIN |
opts.Format() = mlpack::FileType::BIN; |
mlpack models and objects | Load/save the object using an efficient packed binary format. |
JSON |
opts.Format() = mlpack::FileType::JSON; |
mlpack models and objects | Load/save the object using human- and machine-readable JSON. |
XML |
opts.Format() = mlpack::FileType::XML; |
mlpack models and objects | Load/save the object using XML (warning: may be very large). |
Notes:
-
FileType::BIN(.bin) is recommended for the sake of size; objects in binary format may be an order of magnitude or more smaller than JSON! -
FileType::JSON(.json) andFileType::XML(.xml) produce human-readable files, but they may be quite large.
π Formats
The Load() and Save() functions
support numerous different formats for loading and saving. Not all formats are
relevant for all types of data. The table below lists standalone options that
can be used to specify the format, as well as member functions for a
DataOptions object.
When AutoDetect (the default) is specified as the format, the actual file
format is auto-detected using the filenameβs extension and (if loading)
inspecting the file contents. Accepted filename extensions for each type are
given in the table.
| Standalone operator | Member function | Filename extensions | Available for: | Description |
|---|---|---|---|---|
AutoDetect (default) |
opts.Format() = FileType::AutoDetect |
(n/a) | All data types. | The format of the file is autodetected as one of the formats below. |
CSV |
opts.Format() = FileType::CSVASCII; |
.csv, .tsv |
Numeric and categorical data | CSV or TSV format. If loading a sparse matrix and the CSV has three columns, the data is interpreted as a coordinate list. |
ArmaAscii |
opts.Format() = FileType::ArmaASCII; |
.txt, .csv |
Numeric data | Space-separated values as saved by Armadillo with the arma_ascii format. |
RawAscii |
opts.Format() = FileType::RawASCII; |
.txt |
Numeric data | Space-separated values or tab-separated values (TSV) with no header. |
CoordAscii |
opts.Format() = FileType::CoordAscii; |
.txt (if X is sparse) |
Numeric data where X is a sparse matrix (e.g. arma::sp_mat). |
Coordinate list format for sparse data (see coord_ascii). |
ARFF |
opts.Format() = FileType::ARFFAscii; |
.arff |
Categorical data | ARFF filetype. Used specifically to load mixed categorical dataset. See ARFF documentation. Only for loading. |
PGM |
opts.Format() = FileType::PGMBinary; |
.pgm |
Numeric data | Load/save in the PGM image format; data should have values in the range [0, 255]. The size of the image will be the same as the size of the matrix (after any transpose is applied). |
PPM |
opts.Format() = FileType::PPMBinary; |
.ppm |
Numeric data | Load/save in the PPM image format; data should have values in the range [0, 255]. The size of the image will be the same as the size of the matrix (after any transpose is applied). |
HDF5 |
opts.Format() = FileType::HDF5Binary; |
.h5, .hdf5, .hdf, .he5 |
Numeric data | Load/save in the HDF5 binary format; only available if Armadillo is configured with HDF5 support. |
ArmaBin |
opts.Format() = FileType::ArmaBinary; |
.bin (if X is an Armadillo type) |
Numeric data | Load/save in the space-efficient arma_binary format (packed binary data). |
RawBinary |
opts.Format() = FileType::RawBinary; |
Β | Numeric data | Load/save as packed binary data with no header and no size information; data will be loaded as a single column vector (not recommended). |
Image |
opts.Format() = FileType::ImageType |
(n/a) | Image data | Load in the image format detected by the header of the file; save in the image format specified by the filenameβs extension. |
PNG |
opts.Format() = FileType::PNG |
.png |
Image data | Load/save as a PNG image. |
JPG |
opts.Format() = FileType::JPG |
.jpg, .jpeg |
Image data | Load/save as a JPEG image. |
TGA |
opts.Format() = FileType::TGA |
.tga |
Image data | Load/save as a TGA image. |
BMP |
opts.Format() = FileType::BMP |
.bmp |
Image data | Load/save as a BMP image. |
PSD |
opts.Format() = FileType::PSD |
.psd |
Image data | Load/save as a PSD (Photoshop) image. Only for loading. |
GIF |
opts.Format() = FileType::GIF |
.gif |
Image data | Load/save as a GIF image. Only for loading. |
PIC |
opts.Format() = FileType::PIC |
.pic |
Image data | Load/save as a PIC (PICtor) image. Only for loading. |
PNM |
opts.Format() = FileType::PNM |
.pnm |
Image data | Load/save as a PNM (Portable Anymap) image. Only for loading. |
BIN |
opts.Format() = FileType::BIN |
.bin |
mlpack models and objects | Load/save the object using an efficient packed binary format. |
JSON |
opts.Format() = FileType::JSON |
.json |
mlpack models and objects | Load/save the object using human- and machine-readable JSON. |
XML |
opts.Format() = FileType::XML |
.xml |
mlpack models and objects | Load/save the object using XML (warning: may be very large). |
π Numeric data
Standard numeric data is represented in mlpack as a column-major matrix and a variety of formats for loading and saving are supported.
-
When calling
Load()andSave(),Xshould have typearma::mator any other supported matrix type (e.g.arma::fmat,arma::umat, and so forth). -
When calling
Load()with a vectorfilenames, all files must have the same number of dimensions and header names (if using CSVs with headers). All files will be concatenated into the output matrixX. -
When loading and saving with an instantiated
DataOptionsobject, theMatrixOptionsandTextOptionssubtypes can be used. -
Supported formats are CSV, TSV, text, binary, ARFF, and others; see the table of format options.
π Numeric data load/save examples
Load two datasets, print information about them, modify them, and save them back to disk.
// Throw an exception if loading fails with the Fatal option.
// See https://datasets.mlpack.org/satellite.train.csv.
arma::mat dataset;
mlpack::Load("satellite.train.csv", dataset, mlpack::Fatal);
// See https://datasets.mlpack.org/satellite.train.labels.csv.
arma::Row<size_t> labels;
mlpack::Load("satellite.train.labels.csv", labels, mlpack::Fatal);
// Print information about the data.
std::cout << "The data in 'satellite.train.csv' has: " << std::endl;
std::cout << " - " << dataset.n_cols << " points." << std::endl;
std::cout << " - " << dataset.n_rows << " dimensions." << std::endl;
std::cout << "The labels in 'satellite.train.labels.csv' have: " << std::endl;
std::cout << " - " << labels.n_elem << " labels." << std::endl;
std::cout << " - A maximum label of " << labels.max() << "." << std::endl;
std::cout << " - A minimum label of " << labels.min() << "." << std::endl;
// Modify and save the data. Add 2 to the data and drop the last column.
dataset += 2;
dataset.shed_col(dataset.n_cols - 1);
labels.shed_col(labels.n_cols - 1);
// Don't throw an exception if saving fails. Technically there is no need to
// explicitly specify NoFatal---it is the default.
mlpack::Save("satellite.train.mod.csv", dataset, mlpack::NoFatal);
mlpack::Save("satellite.train.labels.mod.csv", labels, mlpack::NoFatal);
Load a dataset stored in a binary format and save it as a CSV.
// See https://datasets.mlpack.org/iris.bin.
arma::mat dataset;
mlpack::Load("iris.bin",
dataset, mlpack::Fatal + mlpack::ArmaBin);
// Save it back to disk as a CSV.
mlpack::Save("iris.converted.csv", dataset, mlpack::CSV);
Load a dataset that has a semicolon as a separator instead of a comma.
// First write the semicolon file to disk.
std::fstream f;
f.open("semicolon.csv", std::fstream::out);
f << "1; 2; 3; 4" << std::endl;
f << "5; 6; 7; 8" << std::endl;
f << "9; 10; 11; 12" << std::endl;
// Now create a TextOptions and specify that the separator is a semicolon.
// Since all of the elements are integers, we load into an `arma::umat` (a
// matrix that holds unsigned integers) instead of an `arma::mat`.
arma::umat dataset;
mlpack::TextOptions opts;
opts.Semicolon() = true;
// Note that instead of `opts` we could just specify `Semicolon` instead!
mlpack::Load("semicolon.csv", dataset, opts);
std::cout << "The data in 'semicolon.csv' has: " << std::endl;
std::cout << " - " << dataset.n_cols << " points." << std::endl;
std::cout << " - " << dataset.n_rows << " dimensions." << std::endl;
Load a dataset with missing elements, and replace the missing elements with NaN
using the MissingToNan option.
// First write a CSV file to disk with some missing values.
std::fstream f;
f.open("missing_to_nan.csv", std::fstream::out);
// Missing 2 value in the first row.
f << "1, , 3, 4" << std::endl;
f << "5, 6, 7, 8" << std::endl;
f << "9, 10, 11, 12" << std::endl;
arma::mat dataset;
mlpack::TextOptions opts;
opts.MissingToNan() = true;
// Note that instead of `opts` we could just specify `MissingToNan` instead!
mlpack::Load("missing_to_nan.csv", dataset, opts);
// Print information about the data.
std::cout << "Loaded data:" << std::endl;
std::cout << dataset;
Load a CSV into a 32-bit floating point matrix and print the headers (column names).
// See https://datasets.mlpack.org/Admission_Predict.csv.
arma::fmat dataset;
// We have to make a TextOptions object so that we can recover the headers.
mlpack::TextOptions opts;
opts.Format() = mlpack::FileType::CSVASCII;
opts.HasHeaders() = true;
mlpack::Load("Admission_Predict.csv", dataset, opts);
std::cout << "Found " << opts.Headers().size() << " columns." << std::endl;
for (size_t i = 0; i < opts.Headers().size(); ++i)
{
std::cout << " - Column " << i << ": '" << opts.Headers()[i] << "'."
<< std::endl;
}
Load a CSV containing a coordinate list into a sparse matrix and print the overall size of the loaded matrix.
// See https://datasets.mlpack.org/movielens-100k.csv.
arma::sp_mat dataset;
// A 3-column CSV into a sparse matrix is interpreted as a coordinate list.
mlpack::Load("movielens-100k.csv", dataset, mlpack::CSV);
std::cout << "Loaded data from movielens-100k.csv; matrix size: "
<< dataset.n_rows << " x " << dataset.n_cols << "." << std::endl;
π Mixed categorical data
mlpack supports mixed categorical data, e.g., data where some dimensions take
only categorical values (e.g. 0, 1, 2, etc.). When using mlpack, string
data and other non-numerical data must be mapped to categorical values and
represented as part of an arma::mat or other matrix type. Category metadata
is stored in an auxiliary DatasetInfo object.
-
When calling
Load()andSave(),Xshould have typearma::mator any other supported matrix type (e.g.arma::fmat,arma::umat, and so forth). -
To load categorical data, either the
Categoricalstandalone option must be passed, or an instantiatedTextOptionsoptsmust be passed withopts.Categorical() = true. -
Supported formats are CSV, TSV, text, and ARFF; see the table of format options.
-
When loading, each unique non-numeric value is mapped (sequentially) to positive integers. Any columns with non-numeric values are marked as categorical.
-
To access mappings from each categorical value to its original value after load, as well as which dimensions are categorical, an instantiated
TextOptionsoptsmust be passed toLoad(); then, the associatedDatasetInfois accessible viaopts.DatasetInfo(). -
When saving, reverse mappings from positive integers to the original unique non-numeric values in
opts.DatasetInfo()are applied. To set these mappings, as well as which dimensions are categorical, an instantiatedTextOptionsoptsmust be passed toSave()withopts.DatasetInfo()set accordingly.
Categorical data is supported by a number of mlpack algorithms, including
DecisionTree,
HoeffdingTree, and
RandomForest.
π DatasetInfo
mlpack represents categorical data via the use of the auxiliary
DatasetInfo object, which stores information about which dimensions are
numeric or categorical and allows conversion from the original category values
to the numeric values used to represent those categories.
For loading and saving categorical data, an instantiated
TextOptions must be passed to Load() or
Save(); this object contains a DatasetInfo object,
accessible via the
.DatasetInfo() method; e.g.,
opts.DatasetInfo().
Accessing and setting properties
This documentation uses info as the name of the DatasetInfo object,
but if a categorical dataset has been loaded with Load(),
it is instead suggested to use opts.DatasetInfo() in place of info.
info = DatasetInfo(dimensionality)- Create a
DatasetInfoobject with the given dimensionality - All dimensions are assumed to be numeric (not categorical).
- Create a
info.Type(d)- Get the type (categorical or numeric) of dimension
d. - Returns a
Datatype, eitherDatatype::numericorDatatype::categorical. - Calling
info.Type(d) = twill set a dimension to typet, but this should only be done beforeinfois used withLoad()orSave().
- Get the type (categorical or numeric) of dimension
info.NumMappings(d)- Get the number of categories in dimension
das asize_t. - Returns
0if dimensiondis numeric.
- Get the number of categories in dimension
info.Dimensionality()- Return the dimensionality of the object as a
size_t.
- Return the dimensionality of the object as a
Map to and from numeric values
info.MapString<double>(value, d)- Given
value(astd::string), return thedoublerepresenting the categorical mapping (an integer value) ofvaluein dimensiond. - If a mapping for
valuedoes not exist in dimensiond, a new mapping is created, andinfo.NumMappings(d)is increased by one. - If dimension
dis numeric andvaluecannot be parsed as a numeric value, then dimensiondis changed to categorical and a new mapping is returned.
- Given
info.UnmapString(mappedValue, d)- Given
mappedValue(asize_t), return thestd::stringcontaining the original category that mapped to the valuemappedValuein dimensiond. - If dimension
dis not categorical, astd::invalid_argumentis thrown.
- Given
π Categorical data load/save examples
Load and manipulate an ARFF file.
// Load a categorical dataset.
arma::mat dataset;
// Define a TextOptions to load categorical data.
mlpack::TextOptions opts;
opts.Fatal() = true;
opts.Categorical() = true;
// See https://datasets.mlpack.org/covertype.train.arff.
mlpack::Load("covertype.train.arff", dataset, opts);
// Print information about the data.
std::cout << "The data in 'covertype.train.arff' has: " << std::endl;
std::cout << " - " << dataset.n_cols << " points." << std::endl;
std::cout << " - " << opts.DatasetInfo().Dimensionality() << " dimensions."
<< std::endl;
arma::Row<size_t> labels;
// We need to have a second options, since we are loading two different
// data types and extension.
mlpack::TextOptions labelOpts;
labelOpts.Fatal() = true;
// See https://datasets.mlpack.org/covertype.train.labels.csv.
mlpack::Load("covertype.train.labels.csv", labels, labelOpts);
// Print information about each dimension.
for (size_t d = 0; d < opts.DatasetInfo().Dimensionality(); ++d)
{
if (opts.DatasetInfo().Type(d) == mlpack::Datatype::categorical)
{
std::cout << " - Dimension " << d << " is categorical with "
<< opts.DatasetInfo().NumMappings(d) << " categories." << std::endl;
}
else
{
std::cout << " - Dimension " << d << " is numeric." << std::endl;
}
}
// Modify the 5th point. Increment any numeric values, and set any categorical
// values to the string "hooray!".
for (size_t d = 0; d < opts.DatasetInfo().Dimensionality(); ++d)
{
if (opts.DatasetInfo().Type(d) == mlpack::Datatype::categorical)
{
// This will create a new mapping if the string "hooray!" does not already
// exist as a category for dimension d..
dataset(d, 4) = opts.DatasetInfo().MapString<double>("hooray!", d);
}
else
{
dataset(d, 4) += 1.0;
}
}
Manually create a DatasetInfo object and use it to
save a categorical dataset.
// This will manually create the following data matrix (shown as it would appear
// in a CSV):
//
// 1, TRUE, "good", 7.0, 4
// 2, FALSE, "good", 5.6, 3
// 3, FALSE, "bad", 6.1, 4
// 4, TRUE, "bad", 6.1, 1
// 5, TRUE, "unknown", 6.3, 0
// 6, FALSE, "unknown", 5.1, 2
//
// Although the last dimension is numeric, we will take it as a categorical
// dimension.
arma::mat dataset(5, 6); // 6 data points in 5 dimensions.
mlpack::DatasetInfo info(5);
// Set types of dimensions. By default they are numeric so we only set
// categorical dimensions.
info.Type(1) = mlpack::Datatype::categorical;
info.Type(2) = mlpack::Datatype::categorical;
info.Type(4) = mlpack::Datatype::categorical;
// The first dimension is numeric.
dataset(0, 0) = 1;
dataset(0, 1) = 2;
dataset(0, 2) = 3;
dataset(0, 3) = 4;
dataset(0, 4) = 5;
dataset(0, 5) = 6;
// The second dimension is categorical.
dataset(1, 0) = info.MapString<double>("TRUE", 1);
dataset(1, 1) = info.MapString<double>("FALSE", 1);
dataset(1, 2) = info.MapString<double>("FALSE", 1);
dataset(1, 3) = info.MapString<double>("TRUE", 1);
dataset(1, 4) = info.MapString<double>("TRUE", 1);
dataset(1, 5) = info.MapString<double>("FALSE", 1);
// The third dimension is categorical.
dataset(2, 0) = info.MapString<double>("good", 2);
dataset(2, 1) = info.MapString<double>("good", 2);
dataset(2, 2) = info.MapString<double>("bad", 2);
dataset(2, 3) = info.MapString<double>("bad", 2);
dataset(2, 4) = info.MapString<double>("unknown", 2);
dataset(2, 5) = info.MapString<double>("unknown", 2);
// The fourth dimension is numeric.
dataset(3, 0) = 7.0;
dataset(3, 1) = 5.6;
dataset(3, 2) = 6.1;
dataset(3, 3) = 6.1;
dataset(3, 4) = 6.3;
dataset(3, 5) = 5.1;
// The fifth dimension is categorical. Note that `info` will choose to assign
// category values in the order they are seen, even if the category can be
// parsed as a number. So, here, the value '4' will be assigned category '0',
// since it is seen first.
dataset(4, 0) = info.MapString<double>("4", 4);
dataset(4, 1) = info.MapString<double>("3", 4);
dataset(4, 2) = info.MapString<double>("4", 4);
dataset(4, 3) = info.MapString<double>("1", 4);
dataset(4, 4) = info.MapString<double>("0", 4);
dataset(4, 5) = info.MapString<double>("2", 4);
// Print the dataset with mapped categories.
dataset.print("Dataset with mapped categories");
// Print the mappings for the third dimension.
std::cout << "Mappings for dimension 3: " << std::endl;
for (size_t i = 0; i < info.NumMappings(2); ++i)
{
std::cout << " - \"" << info.UnmapString(i, 2) << "\" maps to " << i << "."
<< std::endl;
}
// Now `dataset` is ready for use with an mlpack algorithm that supports
// categorical data. We will save it to `categorical-data.csv`.
mlpack::TextOptions opts;
opts.Categorical() = true;
opts.DatasetInfo() = std::move(info);
mlpack::Save("categorical-data.csv", dataset, opts);
π Image data
mlpack load, saves, and modifies image data using the STB library. STB is a header-only library that is bundled with mlpack; but, it is also possible to use a version of STB available on the system.
When loading images, each image is represented as a flattened single column
vector in a data matrix; each row of the resulting vector will correspond to a
single pixel value (between 0 and 255) in a single channel. If an
ImageOptions was passed to Load(), it
will be populated with the metadata of the image.
Images are flattened along rows, with channel values interleaved, starting from
the top left. Thus, the value of the pixel at position (x, y) in channel c
will be contained in element/row y * (channels) + x * (width * channels) + c
of the flattened vector.
-
Supported image loading formats are JPEG, PNG, TGA, BMP, PSD, GIF, PIC, and PNM; see the table of formats for more details.
-
Multiple images can be loaded into the columns of a single matrix using the overload of
Savethat takes a vector offilenames. -
Supported image saving formats are JPEG, PNG, TGA, and BMP.
-
Accessing the metadata of an image after loading can be done with
opts.Width(),opts.Height(), andopts.Channels(). See theImageOptionsmember documentation for more details. -
mlpack offers several utility functions for image modification and preprocessing, documented in Image preprocessing.
When working with images, the following overload for
Save() is also available:
Save(filenames, X, opts)-
Save each column in
X(anarma::mator other matrix type) as a separate image. -
filenamesis astd::vector<std::string>representing all the images that should be saved. -
optsis anImageOptionsthat contains image metadata. -
opts.Width(),opts.Height(),opts.Channels(), andopts.Quality()should be set to the desired parameters before calling; seeImageOptionsmembers for more details. -
The
ith column ofXwill be saved to theith filename infilenames. -
If all images are saved successfully,
truewill be returned.
-
Note: when loading and saving images, if the element type of X is not
unsigned char (e.g. if image is not arma::Mat<unsigned char>, when
loading, the data will be temporarily loaded as unsigned chars and then
converted, and when saving, X will be converted to unsigned chars before
saving.
π Image data load/save examples
Load a single image, but donβt store the metadata (so, e.g., height, width, and number of channels are unavailable after loading!).
// See https://www.mlpack.org/static/img/numfocus-logo.png.
arma::mat image;
mlpack::Load("numfocus-logo.png", image, PNG);
// If we wanted image metadata, we would need to pass an ImageOptions. See the
// next example.
//
// We could also specify `Image` instead of `PNG` if we did not care which image
// format was used, but just that *some* image format was used.
std::cout << "The image in 'numfocus-logo.png' has " << image.n_rows
<< " pixels." << std::endl;
Load and save a single image:
// See https://www.mlpack.org/static/img/numfocus-logo.png.
mlpack::ImageOptions opts;
opts.Fatal() = true;
arma::mat matrix;
mlpack::Load("numfocus-logo.png", matrix, opts /* format autodetected */);
// `matrix` should now contain one column.
// Print information about the image.
std::cout << "Information about the image in 'numfocus-logo.png': "
<< std::endl;
std::cout << " - " << opts.Width() << " pixels in width." << std::endl;
std::cout << " - " << opts.Height() << " pixels in height." << std::endl;
std::cout << " - " << opts.Channels() << " color channels." << std::endl;
std::cout << "Value at pixel (x=3, y=4) in the first channel: ";
const size_t index = (4 * opts.Width() * opts.Channels()) +
(3 * opts.Channels());
std::cout << matrix[index] << "." << std::endl;
// Increment each pixel value, but make sure they are still within the bounds.
matrix += 1;
matrix.clamp(0, 255);
mlpack::Save("numfocus-logo-mod.png", matrix, opts);
Load and save multiple images:
// Load some favicons from websites associated with mlpack.
std::vector<std::string> images;
// See the following files:
// - https://datasets.mlpack.org/images/mlpack-favicon.png
// - https://datasets.mlpack.org/images/ensmallen-favicon.png
// - https://datasets.mlpack.org/images/armadillo-favicon.png
// - https://datasets.mlpack.org/images/bandicoot-favicon.png
images.push_back("mlpack-favicon.png");
images.push_back("ensmallen-favicon.png");
images.push_back("armadillo-favicon.png");
images.push_back("bandicoot-favicon.png");
mlpack::ImageOptions opts;
opts.Channels() = 1; // Force loading in grayscale.
opts.Fatal() = true;
arma::mat matrix;
mlpack::Load(images, matrix, opts);
// Print information about what we loaded.
std::cout << "Loaded " << matrix.n_cols << " images. Images are of size "
<< opts.Width() << " x " << opts.Height() << " with " << opts.Channels()
<< " color channel." << std::endl;
// Invert images.
matrix = (255.0 - matrix);
// Save as compressed JPEGs with low quality.
opts.Quality() = 75;
std::vector<std::string> outImages;
outImages.push_back("mlpack-favicon-inv.jpeg");
outImages.push_back("ensmallen-favicon-inv.jpeg");
outImages.push_back("armadillo-favicon-inv.jpeg");
outImages.push_back("bandicoot-favicon-inv.jpeg");
mlpack::Save(outImages, matrix, opts);
π mlpack models and objects
Machine learning models and any mlpack object (i.e. anything in the mlpack::
namespace) can be saved with Save() and loaded with
Load(). Serialization is performed using the
cereal serialization toolkit.
-
When calling
Load()andSave(),Xshould be the desired mlpack model or object type. -
When loading and saving with an instantiated
DataOptionsobject, theModelOptionssubtype can be used. -
Supported formats are binary, JSON, and XML; see the table of format options.
Note: when loading an object that was saved in the binary format
(BIN), the C++ type of the
object must be exactly the same (including template parameters) as the
type used to save the object. If not, undefined behavior will occurβmost
likely a crash.
π mlpack models and objects load/save examples
Simple example: create a math::Range object, then save and load it.
mlpack::math::Range r(3.0, 6.0);
// How we can use DataOptions with loading / saving objects.
mlpack::DataOptions opts;
opts.Fatal() = true;
opts.Format() = mlpack::FileType::BIN;
// Save the Range to 'range.bin', using the name "range".
mlpack::Save("range.bin", r, opts);
// Load the range into a new object.
mlpack::math::Range r2;
mlpack::Load("range.bin", r2, mlpack::BIN + mlpack::Fatal);
std::cout << "Loaded range: [" << r2.Lo() << ", " << r2.Hi() << "]."
<< std::endl;
// Modify and save the range as JSON.
r2.Lo() = 4.0;
mlpack::Save("range.json", r2, mlpack::JSON + mlpack::Fatal);
// Now 'range.json' will contain the following:
//
// {
// "range": {
// "cereal_class_version": 0,
// "hi": 6.0,
// "lo": 4.0
// }
// }
Train a LinearRegression model and save it to
disk, then reload it.
// See https://datasets.mlpack.org/admission_predict.csv.
arma::mat data;
mlpack::Load("admission_predict.csv", data, mlpack::NoFatal);
// See https://datasets.mlpack.org/admission_predict.responses.csv.
arma::rowvec responses;
mlpack::Load("admission_predict.responses.csv", responses, mlpack::Fatal);
// Train a linear regression model, fitting an intercept term and using an L2
// regularization parameter of 0.3.
mlpack::LinearRegression lr(data, responses, 0.3, true);
// Save the model using the binary format as a standalone parameter, throwing an
// exception on failure.
mlpack::Save("lr-model.bin", lr, mlpack::Fatal + mlpack::BIN);
std::cout << "Saved model to lr-model.bin." << std::endl;
// Now load the model back, using format autodetection on the filename
// extension.
mlpack::LinearRegression loadedModel;
if (!mlpack::Load("lr-model.bin", loadedModel))
{
std::cout << "Model not loaded successfully from 'lr-model.bin'!"
<< std::endl;
}
else
{
std::cout << "Model loaded successfully from 'lr-model.bin' with "
<< "intercept value of " << loadedModel.Parameters()[0] << "."
<< std::endl;
}