mlpack.nca

nca(...)Neighborhood Components Analysis (NCA)

>>> from mlpack import nca

This program implements Neighborhood Components Analysis, both a linear dimensionality reduction technique and a distance learning technique. The method seeks to improve k-nearest-neighbor classification on a dataset by scaling the dimensions. The method is nonparametric, and does not require a value of k. It works by using stochastic ("soft") neighbor assignments and using optimization techniques over the gradient of the accuracy of the neighbor assignments.

To work, this algorithm needs labeled data. It can be given as the last row of the input dataset (specified with 'input'), or alternatively as a separate matrix (specified with 'labels').

This implementation of NCA uses stochastic gradient descent, mini-batch stochastic gradient descent, or the L_BFGS optimizer. These optimizers do not guarantee global convergence for a nonconvex objective function (NCA's objective function is nonconvex), so the final results could depend on the random seed or other optimizer parameters.

Stochastic gradient descent, specified by the value 'sgd' for the parameter 'optimizer', depends primarily on three parameters: the step size (specified with 'step_size'), the batch size (specified with 'batch_size'), and the maximum number of iterations (specified with 'max_iterations'). In addition, a normalized starting point can be used by specifying the 'normalize' parameter, which is necessary if many warnings of the form 'Denominator of p_i is 0!' are given. Tuning the step size can be a tedious affair. In general, the step size is too large if the objective is not mostly uniformly decreasing, or if zero-valued denominator warnings are being issued. The step size is too small if the objective is changing very slowly. Setting the termination condition can be done easily once a good step size parameter is found; either increase the maximum iterations to a large number and allow SGD to find a minimum, or set the maximum iterations to 0 (allowing infinite iterations) and set the tolerance (specified by 'tolerance') to define the maximum allowed difference between objectives for SGD to terminate. Be careful---setting the tolerance instead of the maximum iterations can take a very long time and may actually never converge due to the properties of the SGD optimizer. Note that a single iteration of SGD refers to a single point, so to take a single pass over the dataset, set the value of the 'max_iterations' parameter equal to the number of points in the dataset.

The L-BFGS optimizer, specified by the value 'lbfgs' for the parameter 'optimizer', uses a back-tracking line search algorithm to minimize a function. The following parameters are used by L-BFGS: 'num_basis' (specifies the number of memory points used by L-BFGS), 'max_iterations', 'armijo_constant', 'wolfe', 'tolerance' (the optimization is terminated when the gradient norm is below this value), 'max_line_search_trials', 'min_step', and 'max_step' (which both refer to the line search routine). For more details on the L-BFGS optimizer, consult either the mlpack L-BFGS documentation (in lbfgs.hpp) or the vast set of published literature on L-BFGS.

By default, the SGD optimizer is used.

## input options

- input (numpy matrix or arraylike, float dtype): [required] Input dataset to run NCA on.
- armijo_constant (float): Armijo constant for L-BFGS. Default value 0.0001.
- batch_size (int): Batch size for mini-batch SGD. Default value 50.
- copy_all_inputs (bool): If specified, all input parameters will be deep copied before the method is run. This is useful for debugging problems where the input parameters are being modified by the algorithm, but can slow down the code.
- labels (numpy vector or array, int/long dtype): Labels for input dataset.
- linear_scan (bool): Don't shuffle the order in which data points are visited for SGD or mini-batch SGD.
- max_iterations (int): Maximum number of iterations for SGD or L-BFGS (0 indicates no limit). Default value 500000.
- max_line_search_trials (int): Maximum number of line search trials for L-BFGS. Default value 50.
- max_step (float): Maximum step of line search for L-BFGS. Default value 1e+20.
- min_step (float): Minimum step of line search for L-BFGS. Default value 1e-20.
- normalize (bool): Use a normalized starting point for optimization. This is useful for when points are far apart, or when SGD is returning NaN.
- num_basis (int): Number of memory points to be stored for L-BFGS. Default value 5.
- optimizer (string): Optimizer to use; 'sgd' or 'lbfgs'. Default value sgd.
- seed (int): Random seed. If 0, 'std::time(NULL)' is used. Default value 0.
- step_size (float): Step size for stochastic gradient descent (alpha). Default value 0.01.
- tolerance (float): Maximum tolerance for termination of SGD or L-BFGS. Default value 1e-07.
- verbose (bool): Display informational messages and the full list of parameters and timers at the end of execution.
- wolfe (float): Wolfe condition parameter for L-BFGS. Default value 0.9.

## output options

The return value from the binding is a dict containing the following elements:

- output (numpy matrix, float dtype): Output matrix for learned distance matrix.