wiki:AutomaticBenchmark

Automatic Benchmark - Google Summer of Code 2013

Student: Marcus Edel

E-Mail: marcus.edel@…

Project Overview: This page contains notes regarding the automatic benchmark system for the GSoC 2013. The project entails writing support scripts which will run mlpack methods on a variety of datasets and produce runtime numbers. The benchmarking scripts will also run the same machine learning methods from other machine learning libraries and then produce runtime graphs.

The Config File

The config file is used by the benchmark script to identify the available methods to be run. The benchmark script is modular. For each library some lines in the config file needs to be written. The lines in the config file specifies:

  • Where the particular script/method is.
  • The datasets to benchmark the method.
  • Supported formats.

The benchmark script will run the script on the basis of this lines.

I've picked YAML as the configuration file format for specifying the structure for the project, because YAML has a clean syntax and YAML was designed from the start to be a data serialization language that's both powerful and human readable.

PyYAML

PyYAML is a YAML parser and emitter for Python. The core of the module is written in pure Python, but as of version 3.0.4, it also supports binding to the high-speed LibYAML implementation written in C. YAML is widely used in all sorts of places, such as the configuration settings for Google's AppEngine?.

Download and Installation

PyYAML requires Python 2.5 or higher.

wget http://pyyaml.org/download/pyyaml/PyYAML-3.10.tar.gz
tar xfvz PyYAML-3.10.tar.gz
cd PyYAML-3.10
python setup.py

or

sudo pip install pyyaml

If you want to install pyyaml system-wide in linux, you can also use a package manager.

sudo apt-get install python-yaml

Example

# General datasets.
Datasets: &pca_datasets
    - 'wine.csv'
    - 'iris.csv'

# MLPACK:
# A Scalable C++  Machine Learning Library
library: mlpack
methods:
    PCA:
        script: methods/mlpack/pca.py
        format: [csv, txt]
        run: true
        plot: ['cities.csv']
        datasets:
            - files: [*pca_datasets, 'cities.csv']
              options: '-d 2'

            - files: ['cities.csv']
               options: '-d 6'
    NMF:
        script: methods/mlpack/nmf.py
        format: [csv, txt]
        datasets:
            - files: ['piano.csv']
              options: '-r 6 -s 42 -u multdist'

This sample document defines an associative array with 2 top level keys: library and methods. The entity methods has two block mappings related to it, PCA and NMF. The block mapping contains an array or list, each element of which is itself an associative array with differing keys. To avoid repetitions in the config file, it's possible to reuse mappings as show with the * operator. Notice that strings do not require enclosure in quotations.

A nice feature in YAML is the concept of documents. A document is not just a separate file in this case. You can have multiple documents in a single stream of YAML, if each one is separated by ---, like this:

# MLPACK:
# A Scalable C++  Machine Learning Library
library: mlpack
methods:
    PCA:
        ...
---
# Weka:
# Data Mining Software in Java
library: weka
methodes:
    PCA:
    ...

Directives

library
description A name to identify the library. The name is also used for the output, for this reason it should be avoided to choose a name with more than 23 characters.
Syntax library: name
Required Yes
script
description Path to the current method which should be tested. You can use the relative path from the benchmark root folder, a absolute path or a symlink.
Syntax script: name
Required Yes
files
description A array of datasets for this method. You can use the relative path from the benchmark root folder, a absolute path or a symlink. Requires a method more than one data set, you should add the data sets in an extra list.
Syntax files: [...] or [ [...] ]
Required Yes
run
description A flag to indicate if the benchmark will be executed.
Syntax run: True | False
Default True
Required No
iterations
description The number of executions for this method. It is recommended to set the value higher than one in order to obtain meaningful results.
Syntax iterations: number
Default 3
Required No
formats
description A array of supported file formats for this method. If this data set isn't available in this format, the benchmark script tries to convert the data set.
Syntax formats: [...]
Required No
options
description The string contains options for this method. The string is passed when the script is started.
Syntax options: String
Default None
Required No

Minimal Configuration

The configuration described here is the smallest possible configuration. The configuration combines all required options to benchmark a method.

# MLPACK:
# A Scalable C++  Machine Learning Library
library: mlpack
methods:
    PCA:
        script: methods/mlpack/pca.py
        format: [csv, txt, hdf5, bin]
        datasets:
            - files: ['isolet.csv']

In this case we benchmark the pca method located in methods/mlpack/pca.py and use the isolet dataset. The pca method supports the following formats txt, csv, hdf5 and bin. The benchmark script use the default values for the non-specified values.

Full Configuration

Combining all the elements discussed above results in the following configuration, which should be placed typically as config.yaml.

# MLPACK:
# A Scalable C++  Machine Learning Library
library: mlpack
methods:
    PCA:
        script: methods/mlpack/pca.py
        format: [csv, txt, hdf5, bin]
        run: true
        iterations: 2
        datasets:
            - files: ['isolet.csv', 'cities']
              options: '-s'

In this case we benchmark the pca method located in methods/mlpack/pca.py with the isolet and the cities dataset. The pca method scales the data before running the pca method. The benchmark performs twice for each dataset. Additionally the pca.py script supports the following file formats txt, csv, hdf5 and bin. If the data isn't available in this particular case the format will be generated.

Test Config

To test the configuration file. Use the following command.

make test config.yaml

The command checks the configuration for correct syntax and then try to open files referred in the configuration.

Packages

Required Packages

  • Python 3.2+
  • Python-yaml (Complete YAML 1.1 parser and emitter for Python.)

The main benchmark script is written with the programming language python and use YAML as the configuration file format for specifying the structure for the project.

Optional Packages

  • Valgrind (Suite of tools for debugging and profiling.)

The benchmark script uses the Massif tool, a heap profiler, from the Valgrind suite to measures how much heap memory the mlpack method uses. By default, the benchmark script doesn't uses the Massif tool to profile the heap, for this reason, it isn't necessary to install Valgrind.

  • matplotlib (2D plotting library for python.)

The benchmark script uses the matplotlib python library, to create the plots. By default, the benchmark script doesn't create any plots, for this reason, it isn't necessary to install the matplotlib python library.

Libraries

The benchmark package already comes with predefined scripts to benchmark the different machine learning libraries:

  •  MLPACK (ALLKFN, ALLKNN, ALLKRANN, DET, FASTMKS, GMM, HMM Generate, HMM Loglik, HMM Train, HMM Viterbi, ICA, KPCA, K-Means, LARS, Linear Regression, Local Coordinate Coding, LSH, NBC, NCA, NMF, PCA, Range Search, Sparse Coding)
  •  WEKA (ALLKNN, K-Means, Linear Regression, NBC, PCA)
  •  MATLAB (ALLKNN, HMM Generate, HMM Viterbi, K-Means, Linear Regression, NBC, NMF, PCA, Range Search)
  •  Shogun (ALLKNN, GMM, KPCA, K-Means, LARS, Linear Regression, NBC, PCA)
  •  Scikit (ALLKNN, GMM, ICA, KPCA, K-Means, LARS, Linear Regression, NBC, NMF, PCA, Sparse Coding)
  •  MLPy (ALLKNN, KPCA, K-Means, LARS, Linear Regression, PCA)

In order to run one of the predefined benchmark scripts you need to install by yourself the specified library.

The Script

The script specifies how to benchmark the specified method. The script has to provide a python class with two functions.

The Python class must specify two methods a Constructor and a RunMethod() function.

An example based on the MLPACK principal component analysis (PCA) method:

class PCA:
  def __init__(self, dataset)
    # Code here.

  def RunMethod(self, options):
    # Code here.
    # return time

In this case we define a class with the name PCA. The name of the class is important because the class name must be listed in the configuration file to benchmark the script.

The first method __init__() is special and is often called the constructor. This function is automatically invoked for the newly-created class instance and in our case this function is invoked at the beginning of the benchmark. One parameter is handed over when the main benchmark script invokes this function. The dataset parameter can contain the path to a data set, or a list of data sets. The constructor should be used to initialize values or to load data sets, for things you only have to do once.

In the second method, RunMethod(), the benchmark should be performed. One parameter is handed over when the main benchmark script invokes this function. The options parameter can contain additional parameters which are important for the method, e.g.: the desired dimensionality of the output data set. At the end of this method the benchmark time should be returned.

Note that the user also has the ability to write code in another language then Python. To achieve this, the RunMethod() can call a function in a different language or can invoke e.g. a bash script and return the result. To call a bash script you can use the following code sample:

import shlex
import subprocess

cmd = shlex.split("ls -l")
s = subprocess.check_output(cmd, shell=False)

Usage

To run the benchmarks, please follow the following steps:

  1. Check out the current sources from subversion. This may take some time because some data sets in the datasets folder are 100MB+ large.
    $ svn co http://svn.cc.gatech.edu/fastlab/mlpack/conf/jenkins-conf/benchmark/
    
  2. Edit the configuration file config.yaml and set the run variable from False to True for the desired method.
  3. Set the correct path for the environment variables depending on which library you would like to benchmark. There are two possibilities you can edit the Makefile and set the correct path for the environment variables or you pass the correct path at the benchmark start.

Edit the Makefile:

export MLPACK_BIN=/path/to/the/mlpack/bin/

Pass the correct path at the benchmark start:

$ make MLPACK_BIN=/path/to/the/mlpack/bin/ run
  1. This step is optional, if you want to benchmark one of the predefined Weka scripts or the Shogun K-Means script you have to build the source files with this command:
    $ make scripts
    

To benchmark a method run make with one of the following extensions from the root folder:

$ make run                   # Perform the benchmark with the given config. Default config.yaml.
$ make memory                # Get memory profiling information with the given config. Default config.yaml.

It is also possible to benchmark only specified libraries. You can specify the libraries with the BLOCK parameter. The following command benchmark only the mlpack and the weka library.

$ make BLOCK=mlpack,weka run

Notes: If necessary you have to set the PYTHONPATH and LD_LIBRARY_PATH to start the benchmark script.

Benchmark

MLPACK

To benchmark the mlpack methods, the scripts uses the several executables. The MLPACK methods have already a built-in timer, so there is no need to provide a new timing function. To get the runtime information, the script starts the executables with the right arguments and parses the timing data.

Example:

Here we run the PCA method with the cities data set and the verbose option displays the necessary information at the end of execution.

$ pca -i cities.csv -o output.csv -v

Output:

[INFO ] Loading 'cities.csv' as CSV data.  Size is 9 x 329.
[INFO ] Performing PCA on dataset...
[INFO ] Saving CSV data to 'output.csv'.
[INFO ] 
[INFO ] Execution parameters:
[INFO ]   help: false
[INFO ]   info: ""
[INFO ]   input_file: cities.csv
[INFO ]   new_dimensionality: 0
[INFO ]   output_file: output.csv
[INFO ]   scale: false
[INFO ]   verbose: true
[INFO ] 
[INFO ] Program timers:
[INFO ]   loading_data: 0.013257s
[INFO ]   saving_data: 0.004750s
[INFO ]   total_time: 0.343844s

To get the runtime informations we just parse the three program timers and calculate the elapsed time. This has the advantage that we get the real execution time.


WEKA

The scripts for the WEKA library are written in java and use the built-in java timer to measure the time. To get the runtime information, the script starts the executables with the right arguments and parses the timing data.

Example:

Here we run the PCA method with the cities data set. The necessary information are shown at the end of the execution.

$ java -classpath ".:/path/to/weka/weka.jar" PCA -i cities.arff

Output:

[INFO ]   total_time: 0.83215s

Notes:

  • The WEKA methods only support files with a header. The benchmark script can convert csv and txt files without a header into the arff format with header information.
  • You can use the provided timer class, to measure the elapsed execution time.
  • You have to build the java source code to benchmark the code. You can use the command make scripts form the benchmark root folder to build all java files in the methods/weka/src folder. Afterwards, the byte code is located in the methods/weka/ folder.

MATLAB

The scripts for the MATLAB library are written in matlab and use the built-in matlab timer to measure the time. To get the runtime information, the script starts the executables with the right arguments and parses the timing data.

Example:

Here we run the PCA method with the cities data set. The necessary information are shown at the end of the execution.

$ matlab -nodisplay -nosplash -r try, PCA(‘-i cities.csv’), catch, exit(1), end, exit(0)

Output:

[INFO ]   total_time: 1.21523s

Notes:

  • The script must be on your matlab path. By default the benchmark script adds the methods/matlab/ folder to the matlab path.
  • The predefined matlab scripts only supports files in the csv and txt format.

Shogun

The scripts for the Shogun library are written in python and C++. The scripts written in python use the built-in python timer to measure the time and the Shogun python interface to invoke the functions. The script written in C++ defined its own timer code to measure the time.

Example:

Here we run the K-Means method with the iris data set and set initial centroids. The necessary information are shown at the end of the execution.

$ ./KMEANS -i iris.csv -I centroids_iris.csv

Output:

[INFO ]   total_time: 0.012535s

Notes:

  • You have to build the K-Means source code to benchmark the code. You can use the command make scripts form the benchmark root folder to build the K-Means method.

Scikit

The scripts for the Scikit library are written in python and use the built-in python timer to measure the time and the Scikit python interface to invoke the functions. To measure the elapsed execution time we don’t have to parse the runtime information.


MLPy

The scripts for the MLPy library are written in python and use the built-in python timer to measure the time and the MLPy python interface to invoke the functions.

Benchmark - Startparameter and Options

Parameters

  • CONFIG [string] - The path to the configuration file to perform the benchmark on. Default 'config.yaml'.
  • BLOCK [string] - Run only the specified blocks defined in the configuration file. Default run all blocks.
  • LOG [boolean] - If set, the reports will be saved in the database. Default 'False'.
  • UPDATE [boolean] - If set, the latest reports in the database are updated. Default 'False'.
  • METHODBLOCK [string] - Run only the specified methods defined in the configuration file. Default run all methods.

Options

  • test [parameters] - Test the configuration file. Check for correct syntax and then try to open files referred in the configuration file.
  • run [parameters] - Perform the benchmark with the given config.
  • memory [parameters] - Get memory profiling information with the given config.
  • scripts - Compile the java files for the weka methods.
  • reports [parameters] - Create the reports.

Database

To save the results we use the python built-in SQLite database. The SQLite is a C library which provides a disk-based database that doesn't require a separate server. To store the results for the methods, it isn't necessary to specify a new function for a method. The whole data is collected by the main benchmark script and is stored in the database.

Schema

http://trac.research.cc.gatech.edu/fastlab/raw-attachment/wiki/AutomaticBenchmark/GSOC_Benchmark_Database.jpg

BUILDS
id INTEGER PRIMARY KEY AUTOINCREMENT Continuous number to identify the build. This number is the reference for the other tables.
build TIMESTAMP NOT NULL Timestamp to identify the build. The timestamp is mainly used to sort the builds and to determine when the build was made.
libary_id INTEGER NOT NULL, FOREIGN KEY(libary_id) REFERENCES libraries(id) Each build is for a single library, this number is the reference for the associated library.
LIBRARIES
id INTEGER PRIMARY KEY AUTOINCREMENT Continuous number to identify the libraries. This number is the reference for the other tables.
name TEXT NOT NULL A Name to identify the library by a name. The name is taken from the configfile.
DATASETS
id INTEGER PRIMARY KEY AUTOINCREMENT Continuous number to identify the data set. This number is the reference for the other tables.
name TEXT NOT NULL UNIQUE A Name to identify the data set by a name. The name is taken from the configfile.
size INTEGER NOT NULL The size of the data set. The size of the data set should be specified in megabyte.
attributes INTEGER NOT NULL The number of attributes of the data set.
instances INTEGER NOT NULL The number of instances if the data set.
type TEXT NOT NULL The type of the data set e.g. "Real".
METHODS
id INTEGER PRIMARY KEY AUTOINCREMENT Continuous number to identify the method. This number is the reference for the other tables.
name TEXT NOT NULL A Name to identify the method by a name. The name is taken from the configfile.
parameters TEXT NOT NULL The specified parameters/options for the given method. The parameters/options are taken from the configfile.
RESULTS
id INTEGER PRIMARY KEY AUTOINCREMENT Continuous number to identify the results.
build_id INTEGER NOT NULL, FOREIGN KEY(build_id) REFERENCES builds(id) This number is a reference of the id from the builds table.
libary_id INTEGER NOT NULL, FOREIGN KEY(libary_id) REFERENCES libraries(id) This number is a reference of the id from the libraries table.
time REAL NOT NULL This value contains the measured time of the specified method.
var REAL NOT NULL This value contains the measured variance of the specified method.
dataset_id INTEGER NOT NULL, FOREIGN KEY(dataset_id) REFERENCES datasets(id) This number is a reference of the id from the datasets table.
method_id INTEGER NOT NULL, FOREIGN KEY(method_id) REFERENCES methods(id) This number is a reference of the id from the methods table.
MEMORY
id INTEGER PRIMARY KEY AUTOINCREMENT Continuous number to identify the memory the results.
build_id INTEGER NOT NULL, FOREIGN KEY(build_id) REFERENCES builds(id) This number is a reference of the id from the builds table.
libary_id INTEGER NOT NULL, FOREIGN KEY(libary_id) REFERENCES libraries(id) This number is a reference of the id from the libraries table.
dataset_id INTEGER NOT NULL, FOREIGN KEY(dataset_id) REFERENCES datasets(id) This number is a reference of the id from the datasets table.
method_id INTEGER NOT NULL, FOREIGN KEY(method_id) REFERENCES methods(id) This number is a reference of the id from the methods table.
memory_info INTEGER NOT NULL, FOREIGN KEY(method_id) REFERENCES methods(id) This field contains the path of the massif logfile.
METHOD_INFO
id INTEGER PRIMARY KEY AUTOINCREMENT Continuous number to identify the method_info the results.
method_id INTEGER NOT NULL, FOREIGN KEY(method_id) REFERENCES methods(id) This number is a reference of the id from the methods table.
info TEXT NOT NULL This value contains the info for the specified method.

Reports

I've picked the matplotlib python library to create the plots for the project because with matplotlib we have full control of the plot properties like styles and fonts.

The data for the graphs, are loaded from the database, for this reason, it isn't necessary to specify a new function for a method. The whole data is collected by the main benchmark script and is stored in the database for later processing.

To create the reports, the values ​​must be written to the database. To store the measured values in the database run the following command from the benchmark the root folder:

make run LOG=True

As with the normal benchmark, you can add parameters to specify the methods and libraries. See the parameter section for more details.

To create the reports page run make with the following extensions from the root folder:

make reports

With this command the reports are saved in the reports folder. To view the reports, you have to open the index.html in a browser. For the design of the reports we have used the Twitter - Bootstrap framework which contains HTML and CSS-based design templates to create websites. Most of the templates are designed to be backward compatible, so the reports are available for almost all devices and browsers.

Description

Top-Plot

The top plot shows the development of mlpack (all time values ​​are summed). So you have an initial impression of the mlpack development of the time.

Notice that the plot is highly unstable, if you add or remove a method or a dataset.

Progress-bar

The progress bar shows the number of data sets in which mlpack is the best in percent.

Bar-Chart

The bar chart shows the timing data for the different methods and data sets. The bars are grouped by the specified library taken from the configuration file.

Line-Chart

The line chart shows the development of given mlpack method (all time values for the ​​specified method are summed). So you have an initial impression if the changes over the time have caused speedups or slowdowns.

Memory-Chart

The memory chart shows how much heap memory the method with the given dataset uses. It is also possible to look more closely into the massif log. For this, the logs are attached under the memory chart.

How To Write A Script

This section is a step by step guide which shows how to write a new script.

  1. Use the following template.
    class ScriptName(object):
      def __init__(self, dataset, timeout=0, verbose=True):
        # Code here.    
     
      def RunMethod(self, options):
        # Code here.
        # return time
    
  2. Open the template and edit the required sections.
    editor script_template.py
    

2.1 Edit the __init__() function.

The method __init__() is special and is often called the constructor. This function is automatically invoked for the newly-created class instance and in our case this function is invoked at the beginning of the benchmark. Two parameters are handed over when the main benchmark script invokes this function. The dataset parameter can contain the path to a data set, or a list of data sets and the timeout parameter which contains the timeout time in seconds.

In this example we do nothing in the __init__() function. However, we want to use the parameter in another function, so we have to add the following lines to make them available for the other function.

self.dataset = dataset
self.timeout = timeout

2.2 Edit the RunMethod() function.

The RunMethod() function is automatically invoked after the __init__() function. One parameter is handed over when the main benchmark script invokes this function. The option parameter can contain additional parameters. The RunMethod() function is the place where to put the code to benchmark a method.

In this example we would like to benchmark the following simple command to benchmark the CPU.

$ echo '2^2^20' | time bc > /dev/null

To achieve that, we use the python subprocess module.

To use the subprocess module we have to import the module with the following line.

import subprocess

If you do not want to deal with lexical analyzers problems, import the python shlex module.

import shlex

Now we can pass the command into the shlex.split function which does all the lexical stuff and pass the output into the subprocess function. The subprocess has a nice benefit. It provides a built-in timeout option, so we can use this for our script.

cmd = shlex.split("echo '2^2^20' | time bc > /dev/null")
s = subprocess.check_output(cmd, shell=False, self.timeout)

Note: "Executing shell commands that incorporate unsanitized input from an untrusted source makes a program vulnerable to shell injection, a serious security flaw which can result in arbitrary command execution. For this reason, the use of {shell=True} is strongly discouraged. The parameter shell=False disables all shell based features, but does not suffer from this vulnerability".

2.3 Measure the time.

To measure the time we can use the provided timer function from the util folder. To use this timer function we have to add the util folder to the import path and import the timer module.

To add the util folder to the import path, add the following lines:

import os
import sys
import inspect

cmd_subfolder = os.path.realpath(os.path.abspath(os.path.join(
os.path.split(inspect.getfile(inspect.currentframe()))[0], "path/to/the/util/folder")))
if cmd_subfolder not in sys.path:
  sys.path.insert(0, cmd_subfolder)

To import the timer module add the following line:

from timer import *

To measure the time with the timer module we have to create a timer object and wrap the code we would like to benchmark with the following command:

totalTimer = Timer()

with totalTimer:
  # code here

To return the time to the benchmark script we use the following command:

return totalTimer.ElapsedTime()

If you follow the steps, the script should look like:

import os
import sys
import inspect

cmd_subfolder = os.path.realpath(os.path.abspath(os.path.join(
os.path.split(inspect.getfile(inspect.currentframe()))[0], "path/to/the/util/folder")))
if cmd_subfolder not in sys.path:
  sys.path.insert(0, cmd_subfolder)

from timer import *

class ScriptName(object):
  def __init__(self, dataset, timeout=0, verbose=True):
    self.dataset = dataset
    self.timeout = timeout
   
    def RunMethod(self, options):
      totalTimer = Timer()
      
      with totalTimer:
        cmd = shlex.split("echo '2^2^20' | time bc > /dev/null")
        s = subprocess.check_output(cmd, shell=False, self.timeout)

      return totalTimer.ElapsedTime()
  1. Add the new script to the configuration file located in the benchmark root folder.

To benchmark the new script we have to specify the run-time parameters in the config.yaml file. You can use the following lines to achieve this:

library: newLibrary
methods:
  ScriptName:
        run: true
        script: path/to/the/new/script/new_script.py
        format: ['']
        datasets:
            - files: ['']

Attachments