mlpack command-line quickstart guide

This page describes how you can quickly get started using mlpack from the command-line and gives a few examples of usage, and pointers to deeper documentation.

This quickstart guide is also available for C++, Python, R, Julia, and Go.

🔗 Installing mlpack

Installing mlpack is straightforward and can be done with your system’s package manager. For instance, for Ubuntu or Debian the command is simply

sudo apt-get install mlpack-bin

On Fedora or Red Hat:

sudo dnf install mlpack

If you use a different distribution, mlpack may be packaged under a different name. And if it is not packaged, you can use a Docker image from Dockerhub:

docker run -it mlpack/mlpack /bin/bash

This Docker image has mlpack’s command-line bindings already built and installed.

If you prefer to build mlpack from scratch, see the main README.

🔗 Simple quickstart example

As a really simple example of how to use mlpack from the command-line, let’s do some simple classification on a subset of the standard machine learning covertype dataset. We’ll first split the dataset into a training set and a testing set, then we’ll train an mlpack random forest on the training data, and finally we’ll print the accuracy of the random forest on the test dataset.

You can copy-paste this code directly into your shell to run it.

# Get the dataset and unpack it.
gunzip covertype-small.labels.csv.gz

# Split the dataset; 70% into a training set and 30% into a test set.
# Each of these options has a shorthand single-character option but here we type
# it all out for clarity.
mlpack_preprocess_split                                       \
    --input_file                     \
    --input_labels_file covertype-small.labels.csv            \
    --training_file covertype-small.train.csv                 \
    --training_labels_file covertype-small.train.labels.csv   \
    --test_file covertype-small.test.csv                      \
    --test_labels_file covertype-small.test.labels.csv        \
    --test_ratio 0.3                                          \

# Train a random forest.
mlpack_random_forest                                  \
    --training_file covertype-small.train.csv         \
    --labels_file covertype-small.train.labels.csv    \
    --num_trees 10                                    \
    --minimum_leaf_size 3                             \
    --print_training_accuracy                         \
    --output_model_file rf-model.bin                  \

# Now predict the labels of the test points and print the accuracy.
# Also, save the test set predictions to the file 'predictions.csv'.
mlpack_random_forest                                    \
    --input_model_file rf-model.bin                     \
    --test_file covertype-small.test.csv                \
    --test_labels_file covertype-small.test.labels.csv  \
    --predictions_file predictions.csv                  \

We can see by looking at the output that we achieve reasonably good accuracy on the test dataset (80%+). The file predictions.csv could also be used by other tools; for instance, we can easily calculate the number of points that were predicted incorrectly:

$ diff -U 0 predictions.csv covertype-small.test.labels.csv | grep '^@@' | wc -l

It’s easy to modify the code above to do more complex things, or to use different mlpack learners, or to interface with other machine learning toolkits.

🔗 Using mlpack for movie recommendations

In this example, we’ll train a collaborative filtering model using mlpack’s mlpack_cf program. We’ll train this on the MovieLens dataset, and then we’ll use the model that we train to give recommendations.

You can copy-paste this code directly into the command line to run it.

gunzip ratings-only.csv.gz
gunzip movies.csv.gz

# Hold out 10% of the dataset into a test set so we can evaluate performance.
mlpack_preprocess_split                 \
    --input_file ratings-only.csv       \
    --training_file ratings-train.csv   \
    --test_file ratings-test.csv        \
    --test_ratio 0.1                    \

# Train the model.  Change the rank to increase/decrease the complexity of the
# model.
mlpack_cf                             \
    --training_file ratings-train.csv \
    --test_file ratings-test.csv      \
    --rank 10                         \
    --algorithm RegSVD                \
    --output_model_file cf-model.bin  \

# Now query the 5 top movies for user 1.
echo "1" > query.csv;
mlpack_cf                             \
    --input_model_file cf-model.bin   \
    --query_file query.csv            \
    --recommendations 10              \
    --output_file recommendations.csv \

# Get the names of the movies for user 1.
echo "Recommendations for user 1:"
for i in `seq 1 10`; do
    item=`cat recommendations.csv | awk -F',' '{ print $'$i' }'`;
    head -n $(($item + 2)) movies.csv | tail -1 | \
        sed 's/^[^,]*,[^,]*,//' | \
        sed 's/\(.*\),.*$/\1/' | sed 's/"//g';

Here is some example output, showing that user 1 seems to have good taste in movies:

Recommendations for user 1:
Casablanca (1942)
Pan's Labyrinth (Laberinto del fauno, El) (2006)
Godfather, The (1972)
Answer This! (2010)
Life Is Beautiful (La Vita è bella) (1997)
Adventures of Tintin, The (2011)
Dark Knight, The (2008)
Out for Justice (1991)
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)
Schindler's List (1993)

🔗 Next steps with mlpack

Now that you have done some simple work with mlpack, you have seen how it can easily plug into a data science production workflow for the command line. But these two examples have only shown a little bit of the functionality of mlpack. Lots of other commands are available with different functionality. A full list of commands and full documentation for each can be found on the following page:

Also, mlpack is much more flexible from C++ and allows much greater functionality. So, more complicated tasks are possible if you are willing to write C++. To get started learning about mlpack in C++, the C++ quickstart is a good place to start.