\section{Introduction}\label{rltutorial_intro_rltut}
Reinforcement Learning is one of the hottest topics right now, with interest surging after Deep\+Mind published their article on training deep neural networks to play Atari games to great success. mlpack implements a complete end-\/to-\/end framework for Reinforcement Learning, featuring multiple environments, policies and methods. Of course, custom environments and policies can be used and plugged into the existing framework with no runtime overhead.

mlpack implements typical benchmark environments (Acrobot, Mountain car etc.), commonly used policies, replay methods and supports asynchronous learning as well. In addition, it can {\tt communicate} with the Open\+AI Gym toolkit for more environments.\section{Table of Contents}\label{rltutorial_toc_rltut}
This tutorial is split into the following sections\+:


\begin{DoxyItemize}
\item \doxyref{Introduction}{p.}{rltutorial_intro_rltut}
\item \doxyref{Table of Contents}{p.}{rltutorial_toc_rltut}
\item \doxyref{Reinforcement Learning Environments}{p.}{rltutorial_environment_rltut}
\item \doxyref{Components of an RL Agent}{p.}{rltutorial_agent_components_rltut}
\item \doxyref{Q-\/\+Learning in mlpack}{p.}{rltutorial_q_learning_rltut}
\item \doxyref{async\+\_\+learning\+\_\+rltut}{p.}{rltutorial_async_learning_rltut}
\item \doxyref{Further documentation}{p.}{rltutorial_further_rltut}
\end{DoxyItemize}\section{Reinforcement Learning Environments}\label{rltutorial_environment_rltut}
mlpack implements a number of the most popular environments used for testing RL agents and algorithms. These include the Cart Pole, Acrobot, Mountain Car and their variations. Of course, as mentioned above, you can communicate with Open\+AI Gym for other environments, like the Atari video games.

A key component of mlpack is its extensibility. It is a simple process to create your own custom environments, specific to your needs, and use it with mlpack\textquotesingle{}s RL framework. All the environments implement a few specific methods and classes which are used by the agents while learning.


\begin{DoxyItemize}
\item {\ttfamily State\+:} The State class is a representation of the environment. For the Cart\+Pole, this would involve storing the position, velocity, angle and angular velocity.
\item {\ttfamily Action\+:} It is an enum naming all the possible actions the agent can take in the environment. Continuing with the Cart\+Pole example, the Action enum would simply contain the two possible actions, backward and forward.
\item {\ttfamily Sample\+:} This method is perhaps the heart of the environment, providing rewards to the agent depending on the state and the action taken, and updates the state based on the action taken as well.
\end{DoxyItemize}

Of course, your custom environment will most likely make use of a number of helper methods, depending on your application, such as the {\ttfamily Dsdt} method in the {\ttfamily Acrobot} environment, used in the {\ttfamily R\+K4} iterative method (also another helper method) to estimate the next state.\section{Components of an R\+L Agent}\label{rltutorial_agent_components_rltut}
A Reinforcement Learning agent, in general, takes actions in an environment in order to maximize a cumulative reward. To that end, it requires a way to choose actions ({\bfseries policy}) and a way to sample previous experiences ({\bfseries replay}).

An example of a simple policy would be an epsilon-\/greedy policy. Using such a policy, the agent will choose actions greedily with some probability epsilon. This probability is slowly decreased over time, balancing the line between exploration and exploitation.

Similarly, an example of a simple replay would be a random replay. At each time step, the interactions between the agent and the environment are saved to a memory buffer and previous experiences are sampled from the buffer to train the agent.

Instantiating the components of an agent can be easily done by passing the Environment as a templated argument and the parameters of the policy/replay to the constructor.

To create a Greedy Policy and Prioritized Replay for the Cart\+Pole environment, we would do the following\+:


\begin{DoxyCode}
GreedyPolicy<CartPole> policy(1.0, 1000, 0.1);
PrioritizedReplay<CartPole> replayMethod(10, 10000, 0.6);
\end{DoxyCode}


The arguments to {\ttfamily policy} are the initial epsilon values, the interval of decrease in its value and the value at which epsilon bottoms out and won\textquotesingle{}t be reduced further. The arguments to {\ttfamily replay\+Method} are size of the batch returned, the number of examples stored in memory, and the degree of prioritization.

In addition to the above components, an RL agent requires many hyperparameters to be tuned during it\textquotesingle{}s training period. These parameters include everything from the discount rate of the future reward to whether Double Q-\/learning should be used or not. The {\ttfamily Training\+Config} class can be instantiated and configured as follows\+:


\begin{DoxyCode}
TrainingConfig config;
config.StepSize() = 0.01;
config.Discount() = 0.9;
config.TargetNetworkSyncInterval() = 100;
config.ExplorationSteps() = 100;
config.DoubleQLearning() = \textcolor{keyword}{false};
config.StepLimit() = 200;
\end{DoxyCode}


The object {\ttfamily config} describes an RL agent, using a step size of 0.\+01 for the optimization process, a discount factor of 0.\+9, sync interval of 200 episodes. This agent only starts learning after storing 100 exploration steps, has a step limit of 200, and does not utilize double q-\/learning.

In this way, we can easily configure an RL agent with the desired hyperparameters.\section{Q-\/\+Learning in mlpack}\label{rltutorial_q_learning_rltut}
Here, we demonstrate Q-\/\+Learning in mlpack through the use of a simple example, the training of a Q-\/\+Learning agent on the Cart\+Pole environment. The code has been broken into chunks for easy understanding.


\begin{DoxyCode}
\textcolor{preprocessor}{#include <mlpack/core.hpp>}
\textcolor{preprocessor}{#include <mlpack/methods/ann/ffn.hpp>}
\textcolor{preprocessor}{#include <mlpack/methods/ann/init_rules/gaussian_init.hpp>}
\textcolor{preprocessor}{#include <mlpack/methods/ann/layer/layer.hpp>}
\textcolor{preprocessor}{#include <mlpack/methods/ann/loss_functions/mean_squared_error.hpp>}
\textcolor{preprocessor}{#include <mlpack/methods/reinforcement_learning/q_learning.hpp>}
\textcolor{preprocessor}{#include <mlpack/methods/reinforcement_learning/environment/cart_pole.hpp>}
\textcolor{preprocessor}{#include <mlpack/methods/reinforcement_learning/policy/greedy_policy.hpp>}
\textcolor{preprocessor}{#include <mlpack/methods/reinforcement_learning/training_config.hpp>}
\textcolor{preprocessor}{#include <ensmallen.hpp>}

\textcolor{keyword}{using namespace }mlpack;
\textcolor{keyword}{using namespace }mlpack::ann;
\textcolor{keyword}{using namespace }ens;
\textcolor{keyword}{using namespace }mlpack::rl;
\end{DoxyCode}


We include all the necessary components of our toy example and declare namespaces for convenience.


\begin{DoxyCode}
\textcolor{keywordtype}{int} main()
\{
  \textcolor{comment}{// Set up the network.}
  FFN<MeanSquaredError<>, GaussianInitialization> model(MeanSquaredError<>(),
      GaussianInitialization(0, 0.001));
  model.Add<Linear<>>(4, 128);
  model.Add<ReLULayer<>>();
  model.Add<Linear<>>(128, 128);
  model.Add<ReLULayer<>>();
  model.Add<Linear<>>(128, 2);
\end{DoxyCode}


The first step in setting our Q-\/learning agent is to setup the network for it to use. Here, we use mlpack\textquotesingle{}s ann module to setup a simple F\+FN network, consisting of a single hidden layer.

\begin{DoxyNote}{Note}
The network constructed here has an input shape of 4 and output shape of 2. This corresponds to the structure of the Cart\+Pole environment, where each state is represented as a column vector with 4 data members (position, velocity, angle, angular velocity). Similarly, the output shape is represented by the number of possible actions, which in this case, is only 2 (foward and backward).
\end{DoxyNote}
The next step would be to setup the other components of the Q-\/learning agent, namely its policy, replay method and hyperparameters.


\begin{DoxyCode}
\textcolor{comment}{// Set up the policy and replay method.}
 GreedyPolicy<CartPole> policy(1.0, 1000, 0.1, 0.99);
 RandomReplay<CartPole> replayMethod(10, 10000);

 TrainingConfig config;
 config.StepSize() = 0.01;
 config.Discount() = 0.9;
 config.TargetNetworkSyncInterval() = 100;
 config.ExplorationSteps() = 100;
 config.DoubleQLearning() = \textcolor{keyword}{false};
 config.StepLimit() = 200;
\end{DoxyCode}


And now, we get to the heart of the program, declaring a Q-\/\+Learning agent.


\begin{DoxyCode}
QLearning<CartPole, decltype(model), AdamUpdate, decltype(policy)>
    agent(std::move(config), std::move(model), std::move(policy),
    std::move(replayMethod));
\end{DoxyCode}


Here, we call the {\ttfamily Q\+Learning} constructor, passing in the type of environment, network, updater, policy and replay. We use {\ttfamily decltype(var)} as a shorthand for the variable, saving us the trouble of copying the lengthy templated type.

Similarly, {\ttfamily std\+::move} is called for convenience, moving the components instead of duplicating them and copying them over.

We have our Q-\/\+Learning agent {\ttfamily agent} ready to be trained on the Cart Pole environment.


\begin{DoxyCode}
  arma::running\_stat<double> averageReturn;
  \textcolor{keywordtype}{size\_t} episodes = 0;
  \textcolor{keywordtype}{bool} converged = \textcolor{keyword}{true};
  \textcolor{keywordflow}{while} (\textcolor{keyword}{true})
  \{
    \textcolor{keywordtype}{double} episodeReturn = agent.Episode();
    averageReturn(episodeReturn);
    episodes += 1;

    \textcolor{keywordflow}{if} (episodes > 1000)
    \{
      std::cout << \textcolor{stringliteral}{"Cart Pole with DQN failed."} << std::endl;
      converged = \textcolor{keyword}{false};
      \textcolor{keywordflow}{break};
    \}

    std::cout << \textcolor{stringliteral}{"Average return: "} << averageReturn.mean()
        << \textcolor{stringliteral}{" Episode return: "} << episodeReturn << std::endl;
    \textcolor{keywordflow}{if} (averageReturn.mean() > 35)
      \textcolor{keywordflow}{break};
  \}
  \textcolor{keywordflow}{if} (converged)
    std::cout << \textcolor{stringliteral}{"Hooray! Q-Learning agent successfully trained"} << std::endl;

  \textcolor{keywordflow}{return} 0;
\}
\end{DoxyCode}


We set up a loop to train the agent. The exit condition is determined by the average reward which can be computed with {\ttfamily arma\+::running\+\_\+stat}. It is used for storing running statistics of scalars, which in this case is the reward signal. The agent can be said to have converged when the average return reaches a predetermined value (i.\+e. $>$ 35).

Conversely, if the average return does not go beyond that amount even after a thousand episodes, we can conclude that the agent will not converge and exit the training loop.\section{async\+\_\+learning\+\_\+rltut}\label{rltutorial_async_learning_rltut}
In 2016, Researchers at Deepmind and University of Montreal published their paper \char`\"{}\+Asynchronous Methods for Deep Reinforcement Learning\char`\"{}. In it they described asynchronous variants of four standard reinforcement learning algorithms\+:
\begin{DoxyItemize}
\item One-\/\+Step S\+A\+R\+SA
\item One-\/\+Step Q-\/\+Learning
\item N-\/\+Step Q-\/\+Learning
\item Advantage Actor-\/\+Critic(A3C)
\end{DoxyItemize}

Online RL algorithms and Deep Neural Networks make an unstable combination because of the non-\/stationary and correlated nature of online updates. Although this is solved by Experience Replay, it has several drawbacks\+: it uses more memory and computation per real interaction; and it requires off-\/policy learning algorithms.

Asynchronous methods, instead of experience replay, asynchronously executes multiple agents in parallel, on multiple instances of the environment, which solves all the above problems.

Here, we demonstrate Asynchronous Learning methods in mlpack through the training of an async agent. Asynchronous learning involves training several agents simultaneously. Here, each of the agents are referred to as \char`\"{}workers\char`\"{}. Currently mlpack has One-\/\+Step Q-\/\+Learning worker, N-\/\+Step Q-\/\+Learning worker and One-\/\+Step S\+A\+R\+SA worker.

Let\textquotesingle{}s examine the sample code in chunks.

Apart from the includes used for the q-\/learning example, two more have to be included\+:


\begin{DoxyCode}
\textcolor{preprocessor}{#include <mlpack/methods/reinforcement_learning/async_learning.hpp>}
\textcolor{preprocessor}{#include <mlpack/methods/reinforcement_learning/policy/aggregated_policy.hpp>}
\end{DoxyCode}


Here we don\textquotesingle{}t use experience replay, and instead of a single policy, we use three different policies, each corresponding to its worker. Number of workers created, depends on the number of policies given in the Aggregated Policy. The column vector contains the probability distribution for each child policy. We should make sure its size is same as the number of policies and the sum of its elements is equal to 1.


\begin{DoxyCode}
AggregatedPolicy<GreedyPolicy<CartPole>> policy(\{GreedyPolicy<CartPole>(0.7, 5000, 0.1),
                                                 GreedyPolicy<CartPole>(0.7, 5000, 0.01),
                                                 GreedyPolicy<CartPole>(0.7, 5000, 0.5)\},
                                                 arma::colvec(\textcolor{stringliteral}{"0.4 0.3 0.3"}));
\end{DoxyCode}


Now, we will create the \char`\"{}\+One\+Step\+Q\+Learning\char`\"{} agent. We could have used \char`\"{}\+N\+Step\+Q\+Learning\char`\"{} or \char`\"{}\+One\+Step\+Sarsa\char`\"{} here according to our requirement.


\begin{DoxyCode}
OneStepQLearning<CartPole, decltype(model), ens::AdamUpdate, decltype(policy)>
    agent(std::move(config), std::move(model), std::move(policy));
\end{DoxyCode}


Here, unlike the Q-\/\+Learning example, instead of the entire while loop, we use the Train method of the Asynchronous Learning class inside a for loop. 100 training episodes will take around 50 seconds.


\begin{DoxyCode}
\textcolor{keywordflow}{for} (\textcolor{keywordtype}{int} i = 0; i < 100; i++)
\{
  agent.Train(measure);
\}
\end{DoxyCode}


What is \char`\"{}measure\char`\"{} here? It is a lambda function which returns a boolean value (indicating the end of training) and accepts the episode return (total reward of a deterministic test episode) as parameter. So, let\textquotesingle{}s create that.


\begin{DoxyCode}
arma::vec returns(20, arma::fill::zeros);
\textcolor{keywordtype}{size\_t} position = 0;
\textcolor{keywordtype}{size\_t} episode = 0;

\textcolor{keyword}{auto} measure = [&returns, &position, &episode](\textcolor{keywordtype}{double} episodeReturn)
\{
  \textcolor{keywordflow}{if}(episode > 10000) \textcolor{keywordflow}{return} \textcolor{keyword}{true};

  returns[position++] = episodeReturn;
  position = position % returns.n\_elem;
  episode++;

  std::cout << \textcolor{stringliteral}{"Episode No.: "} << episode 
      << \textcolor{stringliteral}{"; Episode Return: "} << episodeReturn 
      << \textcolor{stringliteral}{"; Average Return: "} << arma::mean(returns) << endl;
\};
\end{DoxyCode}


This will train three different agents on three C\+PU threads asynchronously and use this data to update the action value estimate. Voila, thats all there is to it.

Here is the full code to try this right away\+:


\begin{DoxyCode}
\textcolor{preprocessor}{#include <mlpack/core.hpp>}
\textcolor{preprocessor}{#include <mlpack/methods/ann/ffn.hpp>}
\textcolor{preprocessor}{#include <mlpack/methods/ann/init_rules/gaussian_init.hpp>}
\textcolor{preprocessor}{#include <mlpack/methods/ann/layer/layer.hpp>}
\textcolor{preprocessor}{#include <mlpack/methods/ann/loss_functions/mean_squared_error.hpp>}
\textcolor{preprocessor}{#include <mlpack/methods/reinforcement_learning/async_learning.hpp>}
\textcolor{preprocessor}{#include <mlpack/methods/reinforcement_learning/environment/cart_pole.hpp>}
\textcolor{preprocessor}{#include <mlpack/methods/reinforcement_learning/policy/greedy_policy.hpp>}
\textcolor{preprocessor}{#include <mlpack/methods/reinforcement_learning/policy/aggregated_policy.hpp>}
\textcolor{preprocessor}{#include <mlpack/methods/reinforcement_learning/training_config.hpp>}
\textcolor{preprocessor}{#include <ensmallen.hpp>}

\textcolor{keyword}{using namespace }mlpack;
\textcolor{keyword}{using namespace }mlpack::ann;
\textcolor{keyword}{using namespace }mlpack::rl;
\textcolor{keywordtype}{int} main()
\{
  \textcolor{comment}{// Set up the network.}
  FFN<MeanSquaredError<>, GaussianInitialization> model(MeanSquaredError<>(), 
      GaussianInitialization(0, 0.001));
  model.Add<Linear<>>(4, 128);
  model.Add<ReLULayer<>>();
  model.Add<Linear<>>(128, 128);
  model.Add<ReLULayer<>>();
  model.Add<Linear<>>(128, 2);

  AggregatedPolicy<GreedyPolicy<CartPole>> policy(\{GreedyPolicy<CartPole>(0.7, 5000, 0.1),
                                                   GreedyPolicy<CartPole>(0.7, 5000, 0.01),
                                                   GreedyPolicy<CartPole>(0.7, 5000, 0.5)\},
                                                   arma::colvec(\textcolor{stringliteral}{"0.4 0.3 0.3"}));

  TrainingConfig config;
  config.StepSize() = 0.01;
  config.Discount() = 0.9;
  config.TargetNetworkSyncInterval() = 100;
  config.ExplorationSteps() = 100;
  config.DoubleQLearning() = \textcolor{keyword}{false};
  config.StepLimit() = 200;

  
      OneStepQLearning<CartPole, decltype(model), ens::VanillaUpdate, decltype(policy)>
      agent(std::move(config), std::move(model), std::move(policy));

  arma::vec returns(20, arma::fill::zeros);
  \textcolor{keywordtype}{size\_t} position = 0;
  \textcolor{keywordtype}{size\_t} episode = 0;

  \textcolor{keyword}{auto} measure = [&returns, &position, &episode](\textcolor{keywordtype}{double} episodeReturn)
  \{
    \textcolor{keywordflow}{if}(episode > 10000) \textcolor{keywordflow}{return} \textcolor{keyword}{true};

    returns[position++] = episodeReturn;
    position = position % returns.n\_elem;
    episode++;

    std::cout << \textcolor{stringliteral}{"Episode No.: "} << episode 
        << \textcolor{stringliteral}{"; Episode Return: "} << episodeReturn 
        << \textcolor{stringliteral}{"; Average Return: "} << arma::mean(returns) << endl;
  \};

  \textcolor{keywordflow}{for} (\textcolor{keywordtype}{int} i = 0; i < 100; i++)
  \{
    agent.Train(measure);
  \}
\}
\end{DoxyCode}
\section{Further documentation}\label{rltutorial_further_rltut}
For further documentation on the rl classes, consult the \doxyref{complete A\+PI documentation}{p.}{namespacemlpack_1_1rl}.