\section{Introduction}\label{cv_cvintro}
{\bfseries mlpack} implements cross-\/validation support for its learning algorithms, for a variety of performance measures. Cross-\/validation is useful for determining an estimate of how well the learner will generalize to un-\/seen test data. It is a commonly used part of the data science pipeline.

In short, given some learner and some performance measure, we wish to get an average of the performance measure given different splits of the dataset into training data and validation data. The learner is trained on the training data, and the performance measure is evaluated on the validation data.

mlpack currently implements two easy-\/to-\/use forms of cross-\/validation\+:


\begin{DoxyItemize}
\item {\bfseries simple} {\bfseries cross-\/validation}, where we simply desire the performance measure on a single split of the data into a training set and validation set
\item {\bfseries k-\/fold} {\bfseries cross-\/validation}, where we split the data k ways and desire the average performance measure on each of the k splits of the data
\end{DoxyItemize}

In this tutorial we will see the usage examples and details of the cross-\/validation module. Because the cross-\/validation code is generic and can be used with any learner and performance measure, any use of the cross-\/validation code in mlpack has to be in C++.

This tutorial is split into the following sections\+:


\begin{DoxyItemize}
\item \doxyref{Simple cross-\/validation examples}{p.}{cv_cvbasic} Simple cross-\/validation examples
\begin{DoxyItemize}
\item \doxyref{10-\/fold cross-\/validation on softmax regression}{p.}{cv_cvbasic_ex_1} 10-\/fold cross-\/validation on softmax regression
\item \doxyref{10-\/fold cross-\/validation on weighted decision trees}{p.}{cv_cvbasic_ex_2} 10-\/fold cross-\/validation on weighted decision trees
\item \doxyref{10-\/fold cross-\/validation with categorical decision trees}{p.}{cv_cvbasic_ex_3} 10-\/fold cross-\/validation with categorical decision trees
\item \doxyref{Simple cross-\/validation for linear regression}{p.}{cv_cvbasic_ex_4} Simple cross-\/validation for linear regression
\end{DoxyItemize}
\item \doxyref{Performance measures}{p.}{cv_cvbasic_metrics} Performance measures
\item \doxyref{The K\+Fold\+CV and Simple\+CV classes}{p.}{cv_cvbasic_api} The {\ttfamily \doxyref{K\+Fold\+CV}{p.}{classmlpack_1_1cv_1_1KFoldCV}} and {\ttfamily \doxyref{Simple\+CV}{p.}{classmlpack_1_1cv_1_1SimpleCV}} classes
\item \doxyref{Further references}{p.}{cv_cvbasic_further} Further reference
\end{DoxyItemize}\section{Simple cross-\/validation examples}\label{cv_cvbasic}
\subsection{10-\/fold cross-\/validation on softmax regression}\label{cv_cvbasic_ex_1}
Suppose we have some data to train and validate on, as defined below\+:


\begin{DoxyCode}
\textcolor{comment}{// 100-point 6-dimensional random dataset.}
arma::mat data = arma::randu<arma::mat>(6, 100);
\textcolor{comment}{// Random labels in the [0, 4] interval.}
arma::Row<size\_t> labels =
    arma::randi<arma::Row<size\_t>>(100, arma::distr\_param(0, 4));
\textcolor{keywordtype}{size\_t} numClasses = 5;
\end{DoxyCode}


The code above generates an 100-\/point random 6-\/dimensional dataset with 5 classes.

To run 10-\/fold cross-\/validation for softmax regression with accuracy as a performance measure, we can write the following piece of code.


\begin{DoxyCode}
KFoldCV<SoftmaxRegression, Accuracy> cv(10, data, labels, numClasses);
\textcolor{keywordtype}{double} lambda = 0.1;
\textcolor{keywordtype}{double} softmaxAccuracy = cv.Evaluate(lambda);
\end{DoxyCode}


Note that the {\ttfamily Evaluate} method of {\ttfamily \doxyref{K\+Fold\+CV}{p.}{classmlpack_1_1cv_1_1KFoldCV}} takes any hyperparameters of an algorithm---that is, anything that is not {\ttfamily data}, {\ttfamily labels}, {\ttfamily num\+Classes}, {\ttfamily dataset\+Info}, or {\ttfamily weights} (those last three may not be present for every algorithm type). To be more specific, in this example the {\ttfamily Evaluate} method relies on the following \doxyref{Softmax\+Regression}{p.}{classmlpack_1_1regression_1_1SoftmaxRegression} constructor\+:


\begin{DoxyCode}
\textcolor{keyword}{template}<\textcolor{keyword}{typename} OptimizerType = mlpack::optimization::L\_BFGS>
SoftmaxRegression(\textcolor{keyword}{const} arma::mat& data,
                  \textcolor{keyword}{const} arma::Row<size\_t>& labels,
                  \textcolor{keyword}{const} \textcolor{keywordtype}{size\_t} numClasses,
                  \textcolor{keyword}{const} \textcolor{keywordtype}{double} lambda = 0.0001,
                  \textcolor{keyword}{const} \textcolor{keywordtype}{bool} fitIntercept = \textcolor{keyword}{false},
                  OptimizerType optimizer = OptimizerType());
\end{DoxyCode}


which has the parameter {\ttfamily lambda} after three conventional arguments ({\ttfamily data}, {\ttfamily labels} and {\ttfamily num\+Classes}). We can skip passing {\ttfamily fit\+Intercept} and {\ttfamily optimizer} since there are the default values. (Technically, we don\textquotesingle{}t even need to pass {\ttfamily lambda} since there is a default value.)

In general to cross-\/validate you need to specify what machine learning algorithm and metric you are going to use, and then to pass some conventional data-\/related parameters into one of the cross-\/validation constructors and all other parameters (which are generally hyperparameters) into the {\ttfamily Evaluate} method.\subsection{10-\/fold cross-\/validation on weighted decision trees}\label{cv_cvbasic_ex_2}
In the following example we will cross-\/validate \doxyref{Decision\+Tree}{p.}{classmlpack_1_1tree_1_1DecisionTree} with weights. This is very similar to the previous example, except that we also have instance weights for each point in the dataset. We can generate weights for the dataset from the previous example with the code below\+:


\begin{DoxyCode}
\textcolor{comment}{// Random weights for every point from the code snippet above.}
arma::rowvec weights = arma::randu<arma::mat>(1, 100);
\end{DoxyCode}


Given those weights for each point, we can now perform cross-\/validation by also passing the weights to the constructor of {\ttfamily \doxyref{K\+Fold\+CV}{p.}{classmlpack_1_1cv_1_1KFoldCV}\+:} 


\begin{DoxyCode}
KFoldCV<DecisionTree<>, Accuracy> cv2(10, data, labels, numClasses, weights);
\textcolor{keywordtype}{size\_t} minimumLeafSize = 8;
\textcolor{keywordtype}{double} weightedDecisionTreeAccuracy = cv2.Evaluate(minimumLeafSize);
\end{DoxyCode}


As with the previous example, internally this call to {\ttfamily cv2.\+Evaluate()} relies on the following \doxyref{Decision\+Tree}{p.}{classmlpack_1_1tree_1_1DecisionTree} constructor\+:


\begin{DoxyCode}
\textcolor{keyword}{template}<\textcolor{keyword}{typename} MatType, \textcolor{keyword}{typename} LabelsType, \textcolor{keyword}{typename} WeightsType>
DecisionTree(MatType&& data,
             LabelsType&& labels,
             \textcolor{keyword}{const} \textcolor{keywordtype}{size\_t} numClasses,
             WeightsType&& weights,
             \textcolor{keyword}{const} \textcolor{keywordtype}{size\_t} minimumLeafSize = 10,
             \textcolor{keyword}{const} std::enable_if_t<arma::is\_arma\_type<
                 \textcolor{keyword}{typename} std::remove\_reference<WeightsType>::type>::value>*
                  = 0);
\end{DoxyCode}
\subsection{10-\/fold cross-\/validation with categorical decision trees}\label{cv_cvbasic_ex_3}
\doxyref{Decision\+Tree}{p.}{classmlpack_1_1tree_1_1DecisionTree} models can be constructed in multiple other ways. For example, if we have a dataset with both categorical and numerical features, we can also perform cross-\/validation by using the associated {\ttfamily \doxyref{data\+::\+Dataset\+Info}{p.}{namespacemlpack_1_1data_aa243ad7e4d29363b858bbc92b732921d}} object. Thus, given some {\ttfamily \doxyref{data\+::\+Dataset\+Info}{p.}{namespacemlpack_1_1data_aa243ad7e4d29363b858bbc92b732921d}} object called {\ttfamily dataset\+Info} (that perhaps was produced by a call to {\ttfamily \doxyref{data\+::\+Load()}{p.}{namespacemlpack_1_1data_a19805d6585ac8b0be7c4e4b7f081977c}} ), we can perform k-\/fold cross-\/validation in a similar manner to the other examples\+:


\begin{DoxyCode}
KFoldCV<DecisionTree<>, Accuracy> cv3(10, data, datasetInfo, labels,
    numClasses);
\textcolor{keywordtype}{double} decisionTreeWithDIAccuracy = cv3.Evaluate(minimumLeafSize);
\end{DoxyCode}


This particular call to {\ttfamily cv3.\+Evaluate()} relies on the following \doxyref{Decision\+Tree}{p.}{classmlpack_1_1tree_1_1DecisionTree} constructor\+:


\begin{DoxyCode}
\textcolor{keyword}{template}<\textcolor{keyword}{typename} MatType, \textcolor{keyword}{typename} LabelsType>
DecisionTree(MatType&& data,
             \textcolor{keyword}{const} data::DatasetInfo& datasetInfo,
             LabelsType&& labels,
             \textcolor{keyword}{const} \textcolor{keywordtype}{size\_t} numClasses,
             \textcolor{keyword}{const} \textcolor{keywordtype}{size\_t} minimumLeafSize = 10);
\end{DoxyCode}
\subsection{Simple cross-\/validation for linear regression}\label{cv_cvbasic_ex_4}
{\ttfamily \doxyref{Simple\+CV}{p.}{classmlpack_1_1cv_1_1SimpleCV}} has the same interface as {\ttfamily \doxyref{K\+Fold\+CV}{p.}{classmlpack_1_1cv_1_1KFoldCV}}, except it takes as one of its arguments a proportion (from 0 to 1) of data used as a validation set. For example, to validate \doxyref{Linear\+Regression}{p.}{classmlpack_1_1regression_1_1LinearRegression} with 20\% of the data used in the validation set we can write the following code.


\begin{DoxyCode}
\textcolor{comment}{// Random responses for every point from the code snippet in the beginning of}
\textcolor{comment}{// the tutorial.}
arma::rowvec responses = arma::randu<arma::rowvec>(100);

SimpleCV<LinearRegression, MSE> cv4(0.2, data, responses);
\textcolor{keywordtype}{double} lrLambda = 0.05;
\textcolor{keywordtype}{double} lrMSE = cv4.Evaluate(lrLambda);
\end{DoxyCode}
\section{Performance measures}\label{cv_cvbasic_metrics}
The cross-\/validation classes require a performance measure to be specified. {\bfseries mlpack} has a number of performance measures implemented; below is a list\+:


\begin{DoxyItemize}
\item \doxyref{mlpack\+::cv\+::\+Accuracy}{p.}{classmlpack_1_1cv_1_1Accuracy}\+: a simple measure of accuracy
\item \doxyref{mlpack\+::cv\+::\+F1}{p.}{classmlpack_1_1cv_1_1F1}\+: the \doxyref{F1}{p.}{classmlpack_1_1cv_1_1F1} score; depends on an averaging strategy
\item \doxyref{mlpack\+::cv\+::\+M\+SE}{p.}{classmlpack_1_1cv_1_1MSE}\+: minimum squared error (for regression problems)
\item \doxyref{mlpack\+::cv\+::\+Precision}{p.}{classmlpack_1_1cv_1_1Precision}\+: the precision, for classification problems
\item \doxyref{mlpack\+::cv\+::\+Recall}{p.}{classmlpack_1_1cv_1_1Recall}\+: the recall, for classification problems
\end{DoxyItemize}

In addition, it is not difficult to implement a custom performance measure. A class following the structure below can be used\+:


\begin{DoxyCode}
\textcolor{keyword}{class }CustomMeasure
\{
  \textcolor{comment}{//}
  \textcolor{comment}{// This evaluates the metric given a trained model and a set of data (with}
  \textcolor{comment}{// labels or responses) to evaluate on.  The data parameter will be a type of}
  \textcolor{comment}{// Armadillo matrix, and the labels will be the labels that go with the model.}
  \textcolor{comment}{//}
  \textcolor{comment}{// If you know that your model is a classification model (and thus that}
  \textcolor{comment}{// ResponsesType will be arma::Row<size\_t>), it is ok to replace the}
  \textcolor{comment}{// ResponsesType template parameter with arma::Row<size\_t>.}
  \textcolor{comment}{//}
  \textcolor{keyword}{template}<\textcolor{keyword}{typename} MLAlgorithm, \textcolor{keyword}{typename} DataType, \textcolor{keyword}{typename} ResponsesType>
  \textcolor{keyword}{static} \textcolor{keywordtype}{double} Evaluate(MLAlgorithm& model,
                         \textcolor{keyword}{const} DataType& data,
                         \textcolor{keyword}{const} ResponsesType& labels)
  \{
    \textcolor{comment}{// Inside the method you should call model.Predict() and compare the}
    \textcolor{comment}{// values with the labels, in order to get the desired performance measure}
    \textcolor{comment}{// and return it.}
  \}
\};
\end{DoxyCode}


Once this is implemented, then {\ttfamily Custom\+Measure} (or whatever the class is called) is easy to use as a custom performance measure with {\ttfamily \doxyref{K\+Fold\+CV}{p.}{classmlpack_1_1cv_1_1KFoldCV}} or {\ttfamily \doxyref{Simple\+CV}{p.}{classmlpack_1_1cv_1_1SimpleCV}}.\section{The K\+Fold\+C\+V and Simple\+C\+V classes}\label{cv_cvbasic_api}
This section provides details about the {\ttfamily \doxyref{K\+Fold\+CV}{p.}{classmlpack_1_1cv_1_1KFoldCV}} and {\ttfamily \doxyref{Simple\+CV}{p.}{classmlpack_1_1cv_1_1SimpleCV}} classes. The cross-\/validation infrastructure is based on heavy amounts of template metaprogramming, so that any {\bfseries mlpack} learner and any performance measure can be used. Both classes have two required template parameters and one optional parameter\+:


\begin{DoxyItemize}
\item {\ttfamily M\+L\+Algorithm\+:} the type of learner to be used
\item {\ttfamily Metric\+:} the performance measure to be evaluated
\item {\ttfamily Mat\+Type\+:} the type of matrix used to store the data
\end{DoxyItemize}

In addition, there are two more template parameters, but these are automatically extracted from the given {\ttfamily M\+L\+Algorithm} class, and users should not need to specify these parameters except when using an unconventional type like {\ttfamily arma\+::fmat} for data points.

The general structure of the {\ttfamily \doxyref{K\+Fold\+CV}{p.}{classmlpack_1_1cv_1_1KFoldCV}} and {\ttfamily \doxyref{Simple\+CV}{p.}{classmlpack_1_1cv_1_1SimpleCV}} classes is split into two parts\+:


\begin{DoxyItemize}
\item The constructor\+: create the object, and store the data for the {\ttfamily M\+L\+Algorithm} training.
\item The {\ttfamily Evaluate()} method\+: take any non-\/data parameters for the {\ttfamily M\+L\+Algorithm} and calculate the desired performance measure.
\end{DoxyItemize}

This split is important because it defines the A\+PI\+: all data-\/related parameters are passed to the constructor, whereas algorithm hyperparameters are passed to the {\ttfamily Evaluate()} method.\subsection{The K\+Fold\+C\+V and Simple\+C\+V constructors}\label{cv_cvbasic_api_constructor}
There are six constructors available for {\ttfamily \doxyref{K\+Fold\+CV}{p.}{classmlpack_1_1cv_1_1KFoldCV}} and {\ttfamily \doxyref{Simple\+CV}{p.}{classmlpack_1_1cv_1_1SimpleCV}}, each tailored for a different learning situation. Each is given below for the {\ttfamily \doxyref{K\+Fold\+CV}{p.}{classmlpack_1_1cv_1_1KFoldCV}} class, but the same constructors are also available for the {\ttfamily \doxyref{Simple\+CV}{p.}{classmlpack_1_1cv_1_1SimpleCV}} class, with the exception that instead of specifying {\ttfamily k}, the number of folds, the {\ttfamily \doxyref{Simple\+CV}{p.}{classmlpack_1_1cv_1_1SimpleCV}} class takes a parameter between 0 and 1 specifying the percentage of the dataset to use as a validation set.


\begin{DoxyItemize}
\item {\ttfamily K\+Fold\+C\+V(k, xs, ys)}\+: this is for unweighted regression applications and two-\/class classification applications; {\ttfamily xs} is the dataset and {\ttfamily ys} are the responses or labels for each point in the dataset.
\item {\ttfamily K\+Fold\+C\+V(k, xs, ys, num\+Classes)}\+: this is for unweighted classification applications; {\ttfamily xs} is the dataset, {\ttfamily ys} are the class labels for each data point, and {\ttfamily num\+Classes} is the number of classes in the dataset.
\item {\ttfamily K\+Fold\+C\+V(k, xs, dataset\+Info, ys, num\+Classes)}\+: this is for unweighted categorical/numeric classification applications; {\ttfamily xs} is the dataset, {\ttfamily dataset\+Info} is a \doxyref{data\+::\+Dataset\+Info}{p.}{namespacemlpack_1_1data_aa243ad7e4d29363b858bbc92b732921d} object that holds the types of each dimension in the dataset, {\ttfamily ys} are the class labels for each data point, and {\ttfamily num\+Classes} is the number of classes in the dataset.
\item {\ttfamily K\+Fold\+C\+V(k, xs, ys, weights)}\+: this is for weighted regression or two-\/class classification applications; {\ttfamily xs} is the dataset, {\ttfamily ys} are the responses or labels for each point in the dataset, and {\ttfamily weights} are the weights for each point in the dataset.
\item {\ttfamily K\+Fold\+C\+V(k, xs, ys, num\+Classes, weights)}\+: this is for weighted classification applications; {\ttfamily xs} is the dataset, {\ttfamily ys} are the class labels for each point in the dataset; {\ttfamily num\+Classes} is the number of classes in the dataset, and {\ttfamily weights} holds the weights for each point in the dataset.
\item {\ttfamily K\+Fold\+C\+V(k, xs, dataset\+Info, ys, num\+Classes, weights)}\+: this is for weighted cateogrical/numeric classification applications; {\ttfamily xs} is the dataset, {\ttfamily dataset\+Info} is a \doxyref{data\+::\+Dataset\+Info}{p.}{namespacemlpack_1_1data_aa243ad7e4d29363b858bbc92b732921d} object that holds the types of each dimension in the dataset, {\ttfamily ys} are the class labels for each data point, {\ttfamily num\+Classes} is the number of classes in each dataset, and {\ttfamily weights} holds the weights for each point in the dataset.
\end{DoxyItemize}

Note that the constructor you should use is the constructor that most closely matches the constructor of the machine learning algorithm you would like performance measures of. So, for instance, if you are doing multi-\/class softmax regression, you could call the constructor {\ttfamily \char`\"{}\+Softmax\+Regression(xs, ys, num\+Classes)\char`\"{}}. Therefore, for {\ttfamily \doxyref{K\+Fold\+CV}{p.}{classmlpack_1_1cv_1_1KFoldCV}} you would call the constructor {\ttfamily \char`\"{}\+K\+Fold\+C\+V(k, xs, ys, num\+Classes)\char`\"{}} and for {\ttfamily \doxyref{Simple\+CV}{p.}{classmlpack_1_1cv_1_1SimpleCV}} you would call the constructor {\ttfamily \char`\"{}\+Simple\+C\+V(pct, xs, ys, num\+Classes)\char`\"{}}.\subsection{The Evaluate() method}\label{cv_cvbasic_api_evaluate}
The other method that {\ttfamily \doxyref{K\+Fold\+CV}{p.}{classmlpack_1_1cv_1_1KFoldCV}} and {\ttfamily \doxyref{Simple\+CV}{p.}{classmlpack_1_1cv_1_1SimpleCV}} have is the method to actually calculate the performance measure\+: {\ttfamily Evaluate()}. The {\ttfamily Evaluate()} method takes any hyperparameters that would follow the data arguments to the constructor or {\ttfamily Train()} method of the given {\ttfamily M\+L\+Algorithm}. The {\ttfamily Evaluate()} method takes no more arguments than that, and returns the desired performance measure on the dataset.

Therefore, let us suppose that we are interested in cross-\/validating the performance of a softmax regression model, and that we have constructed the appropriate {\ttfamily \doxyref{K\+Fold\+CV}{p.}{classmlpack_1_1cv_1_1KFoldCV}} object using the code below\+:


\begin{DoxyCode}
KFoldCV<SoftmaxRegression, Precision> cv(k, data, labels, numClasses);
\end{DoxyCode}


The \doxyref{Softmax\+Regression}{p.}{classmlpack_1_1regression_1_1SoftmaxRegression} class has the constructor


\begin{DoxyCode}
\textcolor{keyword}{template}<\textcolor{keyword}{typename} OptimizerType = mlpack::optimization::L\_BFGS>
SoftmaxRegression(\textcolor{keyword}{const} arma::mat& data,
                  \textcolor{keyword}{const} arma::Row<size\_t>& labels,
                  \textcolor{keyword}{const} \textcolor{keywordtype}{size\_t} numClasses,
                  \textcolor{keyword}{const} \textcolor{keywordtype}{double} lambda = 0.0001,
                  \textcolor{keyword}{const} \textcolor{keywordtype}{bool} fitIntercept = \textcolor{keyword}{false},
                  OptimizerType optimizer = OptimizerType());
\end{DoxyCode}


Note that all parameters after are {\ttfamily num\+Classes} are optional. This means that we can specify none or any of them in our call to {\ttfamily Evaluate()}. Below is some example code showing three different ways we can call {\ttfamily Evaluate()} with the {\ttfamily cv} object from the code snippet above.


\begin{DoxyCode}
\textcolor{comment}{// First, call with all defaults.}
\textcolor{keywordtype}{double} result1 = cv.Evaluate();

\textcolor{comment}{// Next, call with lambda set to 0.1 and fitIntercept set to true.}
\textcolor{keywordtype}{double} result2 = cv.Evaluate(0.1, \textcolor{keyword}{true});

\textcolor{comment}{// Lastly, create a custom optimizer to use for optimization, and use a lambda}
\textcolor{comment}{// value of 0.5 and fit no intercept.}
optimization::SGD<> sgd(0.05, 50000); \textcolor{comment}{// Step size of 0.05, 50k max iterations.}
\textcolor{keywordtype}{double} result3 = cv.Evaluate(0.5, \textcolor{keyword}{false}, sgd);
\end{DoxyCode}


The same general idea applies to any {\ttfamily M\+L\+Algorithm\+:} all hyperparameters must be passed to the {\ttfamily Evaluate()} method of {\ttfamily \doxyref{K\+Fold\+CV}{p.}{classmlpack_1_1cv_1_1KFoldCV}} or {\ttfamily \doxyref{Simple\+CV}{p.}{classmlpack_1_1cv_1_1SimpleCV}}.\section{Further references}\label{cv_cvbasic_further}
For further documentation, please see the associated Doxygen documentation for each of the relevant classes\+:


\begin{DoxyItemize}
\item \doxyref{mlpack\+::cv\+::\+Simple\+CV}{p.}{classmlpack_1_1cv_1_1SimpleCV}
\item \doxyref{mlpack\+::cv\+::\+K\+Fold\+CV}{p.}{classmlpack_1_1cv_1_1KFoldCV}
\item \doxyref{mlpack\+::cv\+::\+Accuracy}{p.}{classmlpack_1_1cv_1_1Accuracy}
\item \doxyref{mlpack\+::cv\+::\+F1}{p.}{classmlpack_1_1cv_1_1F1}
\item \doxyref{mlpack\+::cv\+::\+M\+SE}{p.}{classmlpack_1_1cv_1_1MSE}
\item \doxyref{mlpack\+::cv\+::\+Precision}{p.}{classmlpack_1_1cv_1_1Precision}
\item \doxyref{mlpack\+::cv\+::\+Recall}{p.}{classmlpack_1_1cv_1_1Recall}
\end{DoxyItemize}

If you are interested in implementing a different cross-\/validation strategy than k-\/fold cross-\/validation or simple cross-\/validation, take a look at the implementations of each of those classes to guide your implementation.

In addition, the \doxyref{hyperparameter tuner}{p.}{namespacemlpack_1_1hpt} documentation may also be relevant.