Introduction

The field of optimization is vast and complex. In optimization problems, we have to find solutions which are optimal or near-optimal with respect to some goals. Usually, we are not able to solve problems in one step, but we follow some process which guides us through problem-solving. mlpack implements multiple strategies for optimizing different objective functions and provides different strategies for selecting and customizing the appropriate optimization algorithm. This tutorial discusses how to use each of the techniques that mlpack implements.

This visualization allows us to see how many popular optimizers perform on different optimization problems. Select a problem to optimize, then select an optimizer and tune its parameters, and see the steps that the optimizer takes plotted in pink. Note that you can zoom in and rotate the drawing. Also, you can compare how different optimizers perform on a given problem in the second graph. As you try a given problem with more optimizers, the objective function vs. the number of iterations is plotted for each optimizer you have tried.

Step-size: 0.01

Iterations: 4000

The defaults here are not necessarily good for the given problem, so it is suggested that the values used be tailored to the task at hand. (Use the mouse to zoom/drag the function.)

First moment coefficient: 0.9

Second moment coefficient: 0.999

As intuition says, system has higher probability of staying in the states with a smaller stepsize. As the stepsize goes up, imbalance becomes stronger. When the stepsize is close to zero, the system stays in the state(s) with the highest cost.

Evaluations: Dynamic

Unique Optimizer
Restrict Evaluations

A plot of the cost reveals distinct properties for each optimizer with its own style of convergence.

Optimizer

AdaDelta - AdaGrad - Adam - AdaGrad - Adam - AdaMax - AMSGrad - Nadam - CNE - IQN - L_BFGS - RMSProp - SMORMS3 - SPALeRASGD - SVRG - SVRG (Barzilai-Borwein) - SARAH - SARAH+ - Katyusha - CMAES

FunctionType API

In order to facilitate consistent implementations, we have defined a FunctionType API that describes all the methods that an objective function may implement. mlpack offers a few variations of this API to cover different function characteristics. This leads to several different APIs for different function types:

FunctionType: a normal, differentiable objective function.
DecomposableFunctionType: a differentiable objective function that can be decomposed into the sum of many objective functions (for SGD-like optimizers).
SparseFunctionType: a decomposable, differentiable objective function with a sparse gradient.
NonDifferentiableFunctionType: a non-differentiable objective function that can only be evaluated.
ConstrainedFunctionType: an objective function with constraints on the allowable inputs.
NonDifferentiableDecomposableFunctionType: a decomposable non-differentiable objective function that can only be evaluated.
ResolvableFunctionType: a differentiable objective function where calculating the gradient with respect to only one parameter is possible.

Each of these types of objective functions require slightly different methods to be implemented. In some cases, methods will be automatically deduced by the optimizers using template metaprogramming and this allows the user to not need to implement every method for a given type of objective function. Each type described above is detailed in the following sections.

The FunctionType API

A function satisfying the FunctionType API is a general differentiable objective function. It must implement an Evaluate() and a Gradient() method. The interface used for that can be the following two methods:

// For non-separable objectives.  This should return the objective value for the
// given parameters.
double Evaluate(const arma::mat& parameters);
// For non-separable differentiable objectives.  This should store the gradient
// for the given parameters in the 'gradient' matrix.
void Gradient(const arma::mat& parameters, arma::mat& gradient);

However, there are some objective functions (like logistic regression) for which it is computationally convenient to calculate both the objective and the gradient at the same time. Therefore, optionally, the following function can be implemented:

// For non-separable differentiable objectives.  This should store the gradient
// fro the given parameters in the 'gradient' matrix and return the objective
// value.
double EvaluateWithGradient(const arma::mat& parameters, arma::mat& gradient);

It is not a problem to implement all three of these methods, but it is not obligatory to. EvaluateWithGradient() will automatically be inferred if it is not written from the Evaluate() and Gradient() functions; similarly, the Evaluate() and Gradient() functions will be inferred from EvaluateWithGradient() if they are not available. However, any automatically inferred method may be slightly slower.

The following optimizers use the FunctionType API:

The DecomposableFunctionType API

A function satisfying the DecomposableFunctionType API is a differentiable objective function that can be decomposed into a number of separable objective functions. Examples of these types of objective functions include those that are a sum of loss on individual data points; so, common machine learning tasks like logistic regression or training a neural network can be expressed as optimizing a decomposable objective function.

Any function implementing the DecomposableFunctionAPI must implement the following four methods:

// For decomposable functions: return the number of parts the optimization
// problem can be decomposed into.
size_t NumFunctions();
// For decomposable objectives.  This should calculate the partial objective
// starting at the decomposable function indexed by 'start' and calculate
// 'batchSize' partial objectives and return the sum.
double Evaluate(const arma::mat& parameters,
                const size_t start,
                const size_t batchSize);
// For separable differentiable objective functions.  This should calculate the
// gradient starting at the decomposable function indexed by 'start' and
// calculate 'batchSize' decomposable gradients and store the sum in the
// 'gradient' matrix.
void Gradient(const arma::mat& parameters,
              const size_t start,
              arma::mat& gradient,
              const size_t batchSize);
// Shuffle the ordering of the functions.
void Shuffle();

Note that the decomposable objective functions should support batch computation—this can allow significant speedups. The Shuffle() method shuffles the ordering of the functions. For some optimizers, randomness is an important component, so it is important that it is possible to shuffle the ordering of the decomposable functions.

As with the regular FunctionType API, it is optional to implement an EvaluateWithGradient() method in place of, or in addition to, the Evaluate() and Gradient() methods. The interface used for that should be the following:

// For decomposable objectives.  This should calculate the partial objective
// starting at the decomposable function indexed by 'start' and calculate
// 'batchSize' partial objectives and return the sum.  This should also
// calculate the gradient starting at the decomposable function indexed by
// 'start' and calculate 'batchSize' decomposable gradients and store the sum in
// the 'gradient' matrix.
double EvaluateWithGradient(const arma::mat& parameters,
                            const size_t start,
                            arma::mat& gradient,
                            const size_t batchSize);

The following mlpack optimizers require functions implementing the DecomposableFunctionType API:

The SparseFunctionType API

A function satisfying the SparseFunctionType API is a decomposable differentiable objective function with the condition that a single individual gradient is sparse. The API is slightly different but similar to the DecomposableFunctionType API; the following methods are necessary:

// For decomposable functions: return the number of parts the optimization
// problem can be decomposed into.
size_t NumFunctions();
// For decomposable objectives.  This should calculate the partial objective
// starting at the decomposable function indexed by 'start' and calculate
// 'batchSize' partial objectives and return the sum.
double Evaluate(const arma::mat& parameters,
                const size_t start,
                const size_t batchSize);
// For separable differentiable objective functions.  This should calculate the
// gradient starting at the decomposable function indexed by 'start' and
// calculate 'batchSize' decomposable gradients and store the sum in the
// 'gradient' matrix, which is a sparse matrix.
void Gradient(const arma::mat& parameters,
              const size_t start,
              arma::sp_mat& gradient,
              const size_t batchSize);

The Shuffle() method is not needed for the SparseFunctionType API.

Note that it is possible to write a Gradient() method that accepts a template parameter GradType, which may be arma::mat (dense Armadillo matrix) or arma::sp_mat (sparse Armadillo matrix). This allows support for both the DecomposableFunctionType API and the SparseFunctionType API, as below:

template<typename GradType>
void Gradient(const arma::mat& parameters,
              const size_t start,
              GradType& gradient,
              const size_t batchSize);

The following mlpack optimizers require an objective function satisfying the SparseFunctionType API:

ParallelSGD

The NonDifferentiableFunctionType API

A function satisfying the NonDifferentiableFunctionType API is a general non-differentiable objective function. Only an Evaluate() method must be implemented, with the following signature:

// For non-separable objectives.  This should return the objective value for the
// given parameters.
double Evaluate(const arma::mat& parameters);

The following mlpack optimizers require an objective function satisfying the NonDifferentiableFunctionType API:

Simulated Annealing

The ConstrainedFunctionType API

A function satisfying the ConstrainedFunctionType API is a general differentiable objective function that has differentiable constraints for the parameters. This API is more complex than the others and requires five methods to be implemented:

// For non-separable objectives.  This should return the objective value for the
// given parameters.
double Evaluate(const arma::mat& parameters);
// For non-separable differentiable objectives.  This should store the gradient
// for the given parameters in the 'gradient' matrix.
void Gradient(const arma::mat& parameters, arma::mat& gradient);
// Return the number of constraints.
size_t NumConstraints();
// Evaluate the constraint with the given index.
double EvaluateConstraint(const size_t index, const arma::mat& parameters);
// Store the gradient of the constraint with the given index in the 'gradient'
// matrix.
void GradientConstraint(const size_t index,
                        const arma::mat& parameters,
                        arma::mat& gradient);

The following mlpack optimizers require a ConstrainedFunctionType:

AugLagrangian

The NonDifferentiableDecomposableFunctionType API

A function satisfying the NonDifferentiableDecomposableFunctionType API is a decomposable non-differentiable objective function. Only an Evaluate() and a NumFunctions() method must be implemented, with the following signatures:

// For decomposable functions: return the number of parts the optimization
// problem can be decomposed into.
size_t NumFunctions();
// For decomposable objectives.  This should calculate the partial objective
// starting at the decomposable function indexed by 'start' and calculate
// 'batchSize' partial objectives and return the sum.
double Evaluate(const arma::mat& parameters,
                const size_t start,
                const size_t batchSize);

The following mlpack optimizers require a NonDifferentiableDecomposableFunctionType:

The ResolvableFunctionType API

A function satisfying the ResolvableFunctionType API is a partially differentiable objective function. For this API, three methods must be implemented, with the following signatures:

// For partially differentiable functions: return the number of partial
// derivatives.
size_t NumFeatures();
// For non-separable objectives.  This should return the objective value for the
// given parameters.
double Evaluate(const arma::mat& parameters);
// For partially differentiable sparse and non-sparse functions.  Store the
// given partial gradient for the parameter index 'j' in the 'gradient' matrix.
template<typename GradType>
void PartialGradient(const arma::mat& parameters,
                     const size_t j,
                     GradType& gradient);

It is not required to templatize so that both sparse and dense gradients can be used, but it can be helpful.

The following mlpack optimizers require a ResolvableFunctionType:

SCD

Optimizer API

An optimizer must implement only the method:

template<typename FunctionType>

double Optimize(FunctionType& function, arma::mat& parameters);

The Optimize() method optimizes the given function function, and stores the best set of parameters in the matrix parameters and returns the best objective value.

If the optimizer requires a given API from above, the following functions from src/mlpack/optimizers/function/static_checks.hpp can be helpful:

Simple Function and Optimizer example

The example below constructs a simple function, where each dimension has a parabola with a distinct minimum. Note, in this example we maintain an ordering with the vector order; in other situations, such as training neural networks, we could simply shuffle the columns of the data matrix in Shuffle().

class ObjectiveFunction
{
 public:
  // A separable function consisting of four quadratics.
  ObjectiveFunction()
  {
    in = arma::vec("20 12 15 100");
    bi = arma::vec("-4 -2 -3 -8");
  }
  size_t NumFunctions() { return 4; }
  void Shuffle() { ord = arma::shuffle(arma::uvec("0 1 2 3")); }
  double Evaluate(const arma::mat& para, const size_t s, const size_t bs)
  {
    double cost = 0;
    for (size_t i = s; i < s + bs; i++)
      cost += para(ord[i]) * para(ord[i]) + bi(ord[i]) * para(ord[i]) + in(ord[i]);
    return cost;
  }
  void Gradient(const arma::mat& para, const size_t s, arma::mat& g, const size_t bs)
  {
    g.zeros(para.n_rows, para.n_cols);
    for (size_t i = s; i < s + bs; i++)
      g(ord[i]) += (1.0 / bs) * 2 * para(ord[i]) + bi(ord[i]);
  }
 private:
  // Intercepts.
  arma::vec in;
  // Coefficient.
  arma::vec bi;
  // Function order.
  arma::uvec ord;
};

For the optimization of the defined ObjectiveFunction and other mlpack objective functions, we must implement only an Optimize() method, and a constructor to set some parameters. The code is given below.

class SimpleOptimizer
{
 public:
  SimpleOptimizer(const size_t bs = 1, const double lr = 0.02) : bs(bs), lr(lr) { }
  template<typename FunctionType>
  double Optimize(FunctionType& function, arma::mat& parameter)
  {
    arma::mat gradient;
    for (size_t i = 0; i < 5000; i += bs)
    {
      if (i % function.NumFunctions() == 0)
      {
        function.Shuffle();
      }
      function.Gradient(parameter, i % function.NumFunctions(), gradient, bs);
      parameter -= lr * gradient;
    }
    return function.Evaluate(parameter, 0, function.NumFunctions());
  }
 private:
  size_t bs;
  double lr;
};

Note for the sake of simplicity we omitted checks on the batch size (bs). This optimizer assumes that function.NumFunctions() is a multiple of the batch size, we also omitted other typical parts of real implementations, a more detailed example may be found in the complete API documentation". Still, SimpleOptimizer works with any mlpack objective function which implements Evaluate that can compute a partial objective function, and Gradient that can compute a part of the gradient starting with a separable function.

Finding the minimum of ObjectiveFunction or any other mlpack function can be done as shown below.

ObjectiveFunction function;
arma::mat parameter("0 0 0 0;");
SimpleOptimizer optimizer;
double objective = optimizer.Optimize(function, parameter);
std::cout << "Objective: " << objective << std::endl;
std::cout << "Optimized function parameter: " << parameter << std::endl;

The final value of the objective function should be close to the optimal value, which is the sum of values at the vertices of the parabolas.

Further documentation

Further documentation for each Optimizer may be found in the complete API documentation. In addition, more information on the testing functions may be found in its complete API documentation.