{"id":7966,"date":"2020-10-30T16:01:20","date_gmt":"2020-10-30T16:01:20","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/10\/30\/how-to-make-sense-of-the-reinforcement-learning-agents\/"},"modified":"2020-10-30T16:01:20","modified_gmt":"2020-10-30T16:01:20","slug":"how-to-make-sense-of-the-reinforcement-learning-agents","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/10\/30\/how-to-make-sense-of-the-reinforcement-learning-agents\/","title":{"rendered":"How to Make Sense of the Reinforcement Learning Agents?"},"content":{"rendered":"<div id=\"post-\">\n<p><b>By <a href=\"https:\/\/piojanu.netlify.app\/\" target=\"_blank\" rel=\"noopener noreferrer\">Piotr Januszewski<\/a>, Research Software Engineer and PhD Student<\/b><\/p>\n<p><img alt=\"\" class=\"aligncenter\" data-lazy-loaded=\"1\" src=\"https:\/\/i0.wp.com\/neptune.ai\/wp-content\/uploads\/RL-agents-1.jpg?fit=1920%2C1377&amp;ssl=1\" width=\"100%\"><\/p>\n<p>Based on simply watching how an agent acts in the environment it is hard to tell anything about why it behaves this way and how it works internally. That\u2019s why it is crucial to establish metrics that tell WHY the agent performs in a certain way.<\/p>\n<p>This is challenging especially when the\u00a0<strong>agent doesn\u2019t behave the way we would like it to behave, \u2026 which is like always<\/strong>. Every AI practitioner knows that whatever we work on, most of the time it won\u2019t simply work out of the box (they wouldn\u2019t pay us so much for it otherwise).<\/p>\n<p>In this blog post, you\u2019ll learn\u00a0<strong>what to keep track of to inspect\/debug your agent learning trajectory<\/strong>. I\u2019ll assume you are already familiar with the Reinforcement Learning (RL) agent-environment setting (see Figure 1) and you\u2019ve heard about at least some of the most common RL\u00a0<a href=\"https:\/\/spinningup.openai.com\/en\/latest\/user\/algorithms.html\" rel=\"noopener noreferrer\" target=\"_blank\">algorithms<\/a>\u00a0and\u00a0<a href=\"https:\/\/gym.openai.com\/envs\/#atari\" rel=\"noopener noreferrer\" target=\"_blank\">environments<\/a>.<\/p>\n<p>Nevertheless, don\u2019t worry if you are just beginning your journey with RL. I\u2019ve tried to not depend too much on readers\u2019 prior knowledge and where I couldn\u2019t omit some details, I\u2019ve put references to useful materials.<\/p>\n<div>\n<img src=\"https:\/\/i0.wp.com\/neptune.ai\/wp-content\/uploads\/RL-framework.png\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p><em>Figure 1: The Reinforcement Learning framework (Sutton &amp; Barto, 2018).<\/em><\/p>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p>I\u2019ll start by\u00a0<strong>discussing<\/strong>\u00a0useful\u00a0<strong>metrics<\/strong>\u00a0that give us a glimpse into the training and decision processes of the agent.<\/p>\n<p>Then we will focus on the\u00a0<strong>aggregation statistics of these metrics<\/strong>, like average, that will help us analyze them for many episodes played by the agent throughout the training. These will help root cause any issues with the agent.<\/p>\n<p>At each step, I\u2019ll base my suggestions on my own experience in RL research. Let\u2019s jump right into it!<\/p>\n<p>\u00a0<\/p>\n<h3>Metrics I use to inspect RL agent training<\/h3>\n<p>\u00a0<br \/>There are multiple types of metrics to follow and each of them gives you different information about the model\u2019s performance. So the researcher can get the information about&#8230;<br \/>\u00a0<\/p>\n<h3><strong>&#8230;how is the agent doing<\/strong><\/h3>\n<p>\u00a0<br \/>Here, we will take a closer look at three metrics that diagnose the overall performance of the agent.<\/p>\n<div>\n<img src=\"https:\/\/i0.wp.com\/neptune.ai\/wp-content\/uploads\/RL-grand-theft.png\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p><b><strong>Episode return<\/strong><\/b><\/p>\n<p>This is what we care about the most. The whole agent training is all about getting to the\u00a0<strong>highest expected return possible<\/strong>\u00a0(see Figure 2). If this metric goes up throughout the training, it\u2019s a good sign.<\/p>\n<p><img alt=\"\" class=\"aligncenter\" data-recalc-dims=\"1\" src=\"https:\/\/i1.wp.com\/neptune.ai\/wp-content\/uploads\/RL-equation-1.png\" width=\"100%\"><br \/>\u00a0<\/p>\n<div>\n<img src=\"https:\/\/i2.wp.com\/neptune.ai\/wp-content\/uploads\/RL-equation-2.png\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<div class=\"caption\"><em>Figure 2:\u00a0<a href=\"https:\/\/spinningup.openai.com\/en\/latest\/spinningup\/rl_intro.html#the-rl-problem\" rel=\"noopener noreferrer\" target=\"_blank\">The RL Problem<\/a>. Find a policy \u03c0 that maximizes the objective J. The objective J is an expected return E[R] under the environment dynamics P. \u03c4 is the trajectory played by the agent (or its policy \u03c0).<\/em><\/div>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p>However, it\u2019s much more useful to us when we know what return to expect, or what is a good score.<\/p>\n<p>That\u2019s why you should\u00a0<strong>always look for baselines<\/strong>, others result in an environment you work on, to compare your results with them.<\/p>\n<p>Random agent baseline is often a good start and allows you to recalibrate, feel what is true \u201czero\u201d score in the environment \u2013 the minimal return you can get simply from bunging into the controller (see Figure 3).<\/p>\n<p><img alt=\"\" class=\"aligncenter\" data-recalc-dims=\"1\" src=\"https:\/\/i2.wp.com\/neptune.ai\/wp-content\/uploads\/RL-results-1.png\" width=\"100%\"><\/p>\n<div>\n<img src=\"https:\/\/i2.wp.com\/neptune.ai\/wp-content\/uploads\/RL-results-2.png\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<div class=\"caption\"><em>Figure 3. Table 3 from the\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/1903.00374.pdf\" rel=\"noopener noreferrer\" target=\"_blank\">SimPLe<\/a>\u00a0paper with their results on Atari environments compared to many baselines alongside the random agent and human scores.<\/em><\/div>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p><b><strong>Episode length<\/strong><\/b><\/p>\n<p>This is a useful metric to analyze in conjunction with the episode return. It tells us if our agent is able to live for some time before termination. In MuJoCo environments, where diverse creatures learn to walk (see Figure 4), it tells you e.g. if your agent does some moves before flipping and resetting to the beginning of the episode.<\/p>\n<div>\n<img src=\"https:\/\/i1.wp.com\/neptune.ai\/wp-content\/uploads\/RL-humanoid-falling.gif\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p><b><strong>Solve rate<\/strong><\/b><\/p>\n<p>Yet another metric to analyze with episode return. If your environment has a\u00a0<strong>notion of being solved<\/strong>, then it\u2019s useful to check how many episodes it can solve. For instance, in Sokoban (see Figure 5) there are partial rewards for pushing a box onto a target. That being said, the room is only solved when all boxes are on targets.<\/p>\n<div>\n<img src=\"https:\/\/i2.wp.com\/neptune.ai\/wp-content\/uploads\/RL-puzzle.gif\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p><em>Figure 5. Sokoban is a transportation puzzle, where the player has to push all boxes in the room on the storage targets.<\/em><\/p>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p>So, it is possible for the agent to have a positive episode return, but still don\u2019t finish the task it is required to solve.<\/p>\n<p>One more example can be Google Research Football (see Figure 6) with its academies. There are some partial rewards for moving towards the opponents\u2019 goal, but the academy episode (e.g. exercising counterattack situation in smaller groups) is only considered \u201csolved\u201d when the agent\u2019s team scores a goal.<\/p>\n<div>\n<img src=\"https:\/\/i2.wp.com\/neptune.ai\/wp-content\/uploads\/RL-research-football.gif\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<h3><strong>\u2026progress of training<\/strong><\/h3>\n<p>\u00a0<br \/>There are multiple ways of representing the notion of \u201ctime\u201d and against what to measure progress in RL. Here are the top 4 picks.<\/p>\n<p><b><strong>Total environment steps<\/strong><\/b><\/p>\n<p>This simple metric tells you\u00a0<strong>how much experience, in terms of environment steps or timesteps, the agent already gathered<\/strong>. This is often more informative on training advancement (steps) than wall-time, which highly depends on how fast your machine can simulate the environment and do calculations on a neural network (see Figure 6).<\/p>\n<p><img alt=\"\" class=\"aligncenter\" data-recalc-dims=\"1\" src=\"https:\/\/i1.wp.com\/neptune.ai\/wp-content\/uploads\/RL-training-results-1.png\" width=\"100%\"><\/p>\n<div>\n<img src=\"https:\/\/i2.wp.com\/neptune.ai\/wp-content\/uploads\/RL-training-results-2.png\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p><em>Figure 6. DDPG training on the MuJoCo Ant environment. Both runs took 24h, but on different machines. One did ~5M steps and the other ~9.5M. For the latter, it was enough time to converge. For the former not and it scored worse.<\/em><\/p>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p>Moreover, we report the final agent score together with how much environment steps (often called samples) it took to train it.\u00a0<strong>The higher the score with the fewer samples, the more sample efficient is the agent.<\/strong><\/p>\n<p><b><strong>Training steps<\/strong><\/b><\/p>\n<p>We train neural networks with the Stochastic Gradient Descent (SGD) algorithm (see\u00a0<a href=\"http:\/\/www.deeplearningbook.org\/\" rel=\"noopener noreferrer\" target=\"_blank\">Deep Learning Book<\/a>).<\/p>\n<p><strong>The training steps metric tells us how many batch updates we did to the network<\/strong>. When training from the off-policy replay buffer, we can match it with total environment steps in order to better understand how many times, on average, each sample from the environment is shown to the network to learn from it:<\/p>\n<p><em>batch size * trainings steps \/ total environment steps = batch size \/ rollout length<\/em><\/p>\n<p>where\u00a0<em>rollout length<\/em>\u00a0is the number of new timesteps we gather, on average, during the data collection phase in between training steps (when data collection and training are run sequentially).<\/p>\n<p>The above ratio, sometimes called training intensity,<strong>\u00a0shouldn\u2019t be below 1<\/strong>\u00a0as it would mean that some samples aren\u2019t shown even once to the network! In fact, it should be much higher than 1, e.g. 256 (as set in e.g. RLlib implementation of\u00a0<a href=\"https:\/\/docs.ray.io\/en\/latest\/rllib-algorithms.html#deep-deterministic-policy-gradients-ddpg-td3\" rel=\"noopener noreferrer\" target=\"_blank\">DDPG<\/a>, look for \u201ctraining intensity\u201d).<\/p>\n<p><b><strong>Wall time<\/strong><\/b><\/p>\n<p>This simply tells us\u00a0<strong>how much time an experiment is running<\/strong>.<\/p>\n<p>It can be useful when planning in the future how much time do we need for each experiment to simply finish:<\/p>\n<ul>\n<li>2-3 hours?\n<\/li>\n<li>full night??\n<\/li>\n<li>or a couple of days???\n<\/li>\n<li>whole week?!?!?!\n<\/li>\n<\/ul>\n<p>Yes, some experiments might take even the whole week on your PC to fully converge or train to the maximum episode return the method you use can achieve.<\/p>\n<p>Thankfully, in the development phase, shorter experiments (a few hours, up to 24h) are most of the time good enough to simply tell if the agent is working or not or to test some improvement ideas.<\/p>\n<blockquote>\n<p><em>Note, that you always want to plan your work in such a way, that some experiments are running in the background while you work on something else e.g. code, read, write, think, etc.<\/em><\/p>\n<\/blockquote>\n<p>This is why some dedicated workstations for only running experiments might be useful.<\/p>\n<p><b><strong>Steps per second<\/strong><\/b><\/p>\n<p><strong>How many environment steps an agent does in each second<\/strong>. The average of this value allows you to calculate how much time you need to run\u00a0 some number of environment steps.<\/p>\n<p>\u00a0<\/p>\n<h3><strong>\u2026what is the agent thinking\/doing<\/strong><\/h3>\n<p>\u00a0<br \/>Finally, let\u2019s take a look inside the agent\u2019s brain. In my research \u2013 depending on the project \u2013 I use value function and policy entropy to get a hint of what is going on.<\/p>\n<p><b><strong>State\/Action value function<\/strong><\/b><\/p>\n<p>Q-learning and actor-critic methods make use of\u00a0<a href=\"https:\/\/spinningup.openai.com\/en\/latest\/spinningup\/rl_intro.html#value-functions\" rel=\"noopener noreferrer\" target=\"_blank\">value functions<\/a>\u00a0(VFs).<\/p>\n<p>It\u2019s useful to look at\u00a0<strong>the values they predict to detect some anomalies and see how the agent evaluates its odds in the environment<\/strong>.<\/p>\n<p>In the simplest case, I log the network state value estimate at each episode\u2019s timestep and then average them across the whole episode (more on this in the next section). With more training, this metric should start to match the logged episode return (see Figure 7) or, more often, discounted episode return as it is used to train VF. If it doesn\u2019t, then it is a bad sign.<\/p>\n<div>\n<img src=\"https:\/\/i1.wp.com\/neptune.ai\/wp-content\/uploads\/RL-training-results-3.png\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p><em>Figure 7. An experiment on the Google Research Football environment. With time, as the agent trains, the agent\u2019s value function matches the episode return mean.<\/em><\/p>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p>Moreover, on the VF values chart, we can see if some additional data processing is required.<\/p>\n<p>For instance, in the Cart Pole environment, an agent gets a reward of 1 for every timestep until it falls and dies. Episode return quickly gets to orders of tens and hundreds. A VF network that is initialized in such a way that at the beginning of training it outputs small values around zero has a hard time catching this range of values (see Figure 8).<\/p>\n<p>That\u2019s why some\u00a0<strong>additional return normalization before training with it is required<\/strong>. The easiest approach is simply dividing by the max return possible, but somehow we might not know what is the maximum return or there is no such (see e.g. Q-value normalization in the\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/1911.08265.pdf\" rel=\"noopener noreferrer\" target=\"_blank\">MuZero<\/a>\u00a0paper, Appendix B \u2013 Backup).<\/p>\n<div>\n<img src=\"https:\/\/i2.wp.com\/neptune.ai\/wp-content\/uploads\/RL-training-results-4.png\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p><em>Figure 8. An experiment on the Cart Pole environment. The value function target isn\u2019t normalized and it has a hard time catching up with it.<\/em><\/p>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p>I\u2019ll discuss an example in the next section when this particular metric joint with extreme aggregation helped me detect a bug in my code.<\/p>\n<p><b><strong>Policy entropy<\/strong><\/b><\/p>\n<p>Because some RL methods make use of stochastic policies, we can calculate their entropy:\u00a0<strong>how random they are<\/strong>. Even with the deterministic policies we often use epsilon-greedy exploratory policy of which we can still calculate the entropy.<\/p>\n<p><img alt=\"\" class=\"aligncenter\" data-recalc-dims=\"1\" src=\"https:\/\/i2.wp.com\/neptune.ai\/wp-content\/uploads\/RL-equation-3.gif\" width=\"100%\"><\/p>\n<p>The equation for the policy entropy H, where a is an action and p(a) in an action probability.<\/p>\n<p>The maximum entropy value equals ln(N), where N is the number of actions, and it means that the policy chooses actions uniformly at random. The minimum entropy value equals 0 and it means that always only one action is possible (has 100% probability).<\/p>\n<p>If you observe that the\u00a0<strong>entropy of the agent policy drops rapidly, it\u2019s a bad sign<\/strong>. It means that your agent stops exploring very quickly. If you use stochastic policies, you should think of some entropy regularization methods (e.g.\u00a0<a href=\"https:\/\/spinningup.openai.com\/en\/latest\/algorithms\/sac.html\" rel=\"noopener noreferrer\" target=\"_blank\">Soft Actor-Critic<\/a>). If you are using deterministic policies with epsilon-greedy exploratory policy, probably you use too aggressive schedule for epsilon decay.<\/p>\n<p>\u00a0<\/p>\n<h3><strong>\u2026how the training goes<\/strong><\/h3>\n<p>\u00a0<br \/>Last, but not least, we have some, more standard, Deep Learning metrics.<\/p>\n<p><b><strong>KL divergence<\/strong><\/b><\/p>\n<p>On-policy methods like\u00a0<a href=\"https:\/\/spinningup.openai.com\/en\/latest\/algorithms\/vpg.html\" rel=\"noopener noreferrer\" target=\"_blank\">Vanilla Pol<\/a><a href=\"https:\/\/spinningup.openai.com\/en\/latest\/algorithms\/vpg.html\" rel=\"noopener noreferrer\" target=\"_blank\">i<\/a><a href=\"https:\/\/spinningup.openai.com\/en\/latest\/algorithms\/vpg.html\" rel=\"noopener noreferrer\" target=\"_blank\">cy Gradient<\/a>\u00a0(VPG) train on batches of experience sampled from the current policy (they don\u2019t use any replay buffer with experience to train on).<\/p>\n<p>It means that\u00a0<strong>what we do has a high impact on what we learn<\/strong>. If you set a learning rate too high, then the approximate gradient update might take too big steps in some seemingly promising direction which may push the agent right into the worse region of the state space.<\/p>\n<p>Therefore the agent will do worse than before the update (see Figure 9)! This is why\u00a0<strong>we need to monitor KL divergence between the old and the new policy.<\/strong>\u00a0It can help us e.g. set a learning rate.<\/p>\n<p><img alt=\"\" class=\"aligncenter\" data-recalc-dims=\"1\" src=\"https:\/\/i0.wp.com\/neptune.ai\/wp-content\/uploads\/RL-VPG-training-1.png?resize=279%2C241&amp;ssl=1\" width=\"100%\"><br \/>\u00a0<\/p>\n<div>\n<img src=\"https:\/\/i2.wp.com\/neptune.ai\/wp-content\/uploads\/RL-VPG-training-2.png\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p><em>Figure 9. VPG training on the Cart Pole environment. On the y-axis, we have an episode length (it equals an episode return in this environment). The orange line is the sliding window average of the score. On the left diagram, the learning rate is too big and the training is unstable. On the right diagram, the learning rate was properly fine-tuned (I found it by hand).<\/em><\/p>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.com\/divergence-between-probability-distributions\/\" rel=\"noopener noreferrer\" target=\"_blank\">KL divergence<\/a>\u00a0is a measure of the distance between two distributions. In our case, these are action distributions (policies). We don\u2019t want our policy to differ too much before and after the update. There are methods like\u00a0<a href=\"https:\/\/spinningup.openai.com\/en\/latest\/algorithms\/ppo.html\" rel=\"noopener noreferrer\" target=\"_blank\">PP<\/a><a href=\"https:\/\/spinningup.openai.com\/en\/latest\/algorithms\/ppo.html\" rel=\"noopener noreferrer\" target=\"_blank\">O<\/a>\u00a0that put a constraint on the KL divergence and won\u2019t allow too big updates at all!<\/p>\n<p><b><strong>Network weights\/gradients\/activations histograms<\/strong><\/b><\/p>\n<p>Logging the activations, gradients, and weights histograms of each layer can help you monitor the artificial neural network training dynamics. You should look for signs of:<\/p>\n<ul>\n<li>Dying ReLUs:<br \/>If a ReLU neuron gets clamped to zero in the forward pass, then it won\u2019t get a gradient signal in the backward pass. It can even happen, that\u00a0<strong>some neurons won\u2019t get excited (return a non-zero output) for any input because of unfortunate initialization or too big update during training<\/strong>.<br \/>\u201cSometimes you can forward the entire training set &lt;i.e. the replay buffer in RL&gt; through a trained network and find that a large fraction (e.g. 40%) of your neurons were zero the entire time.\u201d ~\u00a0<a href=\"https:\/\/medium.com\/@karpathy\/yes-you-should-understand-backprop-e2f06eab496b\" rel=\"noopener noreferrer\" target=\"_blank\">Yes you should understand backprop<\/a>\u00a0by Andrej Karpathy\n<\/li>\n<li>Vanishing or Exploding gradients:<br \/><strong>Very large values of gradient updates can indicate exploding gradients<\/strong>. Gradient clipping may help.<br \/>On the other hand, very low values of gradient updates can indicate vanishing gradients. Using\u00a0<a href=\"https:\/\/adventuresinmachinelearning.com\/vanishing-gradient-problem-tensorflow\/\" rel=\"noopener noreferrer\" target=\"_blank\">ReLU activations<\/a>\u00a0and\u00a0<a href=\"https:\/\/towardsdatascience.com\/weight-initialization-in-neural-networks-a-journey-from-the-basics-to-kaiming-954fb9b47c79\" rel=\"noopener noreferrer\" target=\"_blank\">Glorot uniform initializer<\/a>\u00a0(a.k.a. Xavier uniform initializer) should help with it.\n<\/li>\n<li>Vanishing or Exploding activations:<br \/>A good standard deviation for the activations is on the order of 0.5 to 2.0. Significantly outside of this range may indicate\u00a0<strong>vanishing or exploding activations<\/strong>, which in turn may cause problems with gradients. Try Layer\/Batch normalization to keep your activations distribution under control.\n<\/li>\n<\/ul>\n<p>In general, distributions of layer weights (and activations), that are close to normal distribution (values around zero without much outliers) are a sign of healthy training.<\/p>\n<p>The above tips should help you keep your network healthy through training.<\/p>\n<p><b><strong>Policy\/Value\/Quality\/\u2026 heads losses<\/strong><\/b><\/p>\n<p>Even though we do optimize some loss function to train an agent, you should know that this isn\u2019t a loss function in the typical sense of the word. Specifically, it is different from the loss functions used in supervised learning.<\/p>\n<p>We optimize the objective from Figure 2. To do so, in Policy Gradient methods you\u00a0<a href=\"https:\/\/spinningup.openai.com\/en\/latest\/spinningup\/rl_intro3.html#deriving-the-simplest-policy-gradient\" rel=\"noopener noreferrer\" target=\"_blank\">derive the gradient of this objective<\/a>\u00a0(called, Policy Gradient). However, because TensorFlow and other DL frameworks are built around auto-grad, you\u00a0<a href=\"https:\/\/spinningup.openai.com\/en\/latest\/spinningup\/rl_intro3.html#implementing-the-simplest-policy-gradient\" rel=\"noopener noreferrer\" target=\"_blank\">define a s<\/a><a href=\"https:\/\/spinningup.openai.com\/en\/latest\/spinningup\/rl_intro3.html#implementing-the-simplest-policy-gradient\" rel=\"noopener noreferrer\" target=\"_blank\">u<\/a><a href=\"https:\/\/spinningup.openai.com\/en\/latest\/spinningup\/rl_intro3.html#implementing-the-simplest-policy-gradient\" rel=\"noopener noreferrer\" target=\"_blank\">rrogate loss function<\/a>\u00a0that, after the auto-grad is run on it, yields gradient equal to the Policy Gradient.<\/p>\n<p>Note that the data distribution depends on the policy and changes with training. This means that\u00a0<strong>the loss functions don\u2019t have to decrease monotonically for training to proceed<\/strong>. It can sometimes increase when the agent discovers some new area of state space (see Figure 10).<\/p>\n<div>\n<img src=\"https:\/\/i1.wp.com\/neptune.ai\/wp-content\/uploads\/RL-training-results-5.png\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p><em>Figure 10. SAC training on the MuJoCo Humanoid environment. When the episode return starts to go up (our agent learns successfully), the Q-function loss goes up too! It starts to go down again after some time.<\/em><\/p>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p>Moreover, it doesn\u2019t measure the performance of the agent! The<strong>\u00a0true performance of the agent is an episode return<\/strong>. It\u2019s useful to log losses as a sanity check. However, don\u2019t base your judgments on training progress on it.<\/p>\n<p>\u00a0<\/p>\n<h3>Aggregated statistics<\/h3>\n<p>\u00a0<br \/>Of course, for some metrics (like state\/action-values) it\u2019s infeasible to log them for every environment timestep for each experiment. Typically, you would calculate statistics every episode or couple of episodes.<\/p>\n<p>For other metrics, we deal with randomness (e.g. the episode return when the environment and\/or the policy are stochastic). Therefore, we have to use sampling to estimate the expected metric value (sample = one agent episode in the episode return case).<\/p>\n<p>In either case, the aggregate statistics are the solution!<\/p>\n<p>\u00a0<\/p>\n<h3><strong>Average and standard deviation<\/strong><\/h3>\n<p>\u00a0<br \/>When you deal with\u00a0<strong>a stochastic environment<\/strong>\u00a0(e.g. ghosts in the PacMan act randomly) and\/or\u00a0<strong>your policy draws actions at random<\/strong>\u00a0(e.g. stochastic policy in VPG) you should:<\/p>\n<ul>\n<li>play multiple episodes (10-20 should be fine),\n<\/li>\n<li>average metrics across them,\n<\/li>\n<li>log this average and standard deviation.\n<\/li>\n<\/ul>\n<p>The average will better estimate the true expected return than simply one episode and standard deviation gives you a hint of how much the metric changes when playing multiple episodes.<\/p>\n<p>Too high variance and you should take more samples into the average (play more episodes) or make use of one of the smoothing techniques like\u00a0<a href=\"https:\/\/www.datacamp.com\/community\/tutorials\/moving-averages-in-pandas\" rel=\"noopener noreferrer\" target=\"_blank\">Exponential Moving Average<\/a>.<\/p>\n<p>\u00a0<\/p>\n<h3><strong>Minimum\/Maximum value<\/strong><\/h3>\n<p>\u00a0<br \/>It\u2019s really useful to\u00a0<strong>inspect extremes when looking for a bug<\/strong>. I\u2019ll discuss it with the example.<\/p>\n<p>In experiments on Google Research Football with my agent that used random rollouts from the current timestep to calculate action qualities, I noticed some strange minimum values of these action qualities.<\/p>\n<p>The average statistic made sense, but something with minimal values was not good. They were below reasonable minimum value (below minus one, see Figure 11).<\/p>\n<p><img alt=\"\" class=\"aligncenter\" data-recalc-dims=\"1\" src=\"https:\/\/i1.wp.com\/neptune.ai\/wp-content\/uploads\/RL-training-results-6.png?resize=512%2C195&amp;ssl=1\" width=\"100%\"><br \/>\u00a0<\/p>\n<div>\n<img src=\"https:\/\/i0.wp.com\/neptune.ai\/wp-content\/uploads\/RL-training-results-7.png\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p><em>Figure 11. The mean qualities are all above zero. The minimum qualities are very often below minus one, which is lower than should be possible.<\/em><\/p>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p>After some digging, it turned out that I use\u00a0<em>np.empty<\/em>\u00a0to create an array for action qualities.<\/p>\n<p><em>np.empty<\/em>\u00a0is a fancy way of doing\u00a0<em>np.zeros<\/em>\u00a0that allocates memory but doesn\u2019t initialize the NumPy array just yet.<\/p>\n<p>Because of that, from time to time some actions had updated scores (which overrode the initial values in the array) that came from the allocated memory locations that had not been erased!<\/p>\n<p>I changed\u00a0<em>np.empty<\/em>\u00a0to\u00a0<em>np.zeros\u00a0<\/em>and it fixed the problem.<\/p>\n<p>\u00a0<\/p>\n<h3><strong>Median<\/strong><\/h3>\n<p>\u00a0<br \/>The same idea that we used with averaging stochastic episodes, can be applied to the whole training!<\/p>\n<p>As we know, the algorithm used for deep learning is called\u00a0<em>Stochastic<\/em>\u00a0Gradient Descent. It\u2019s stochastic because we draw training samples at random and pack them into batches. This means that\u00a0<strong>running one training multiple times will yield different results.<\/strong><\/p>\n<p>You should always run your training multiple times with different seeds (pseudo-random numbers generator initialization) and\u00a0<strong>report the median of these runs to be sure that the score is not that high or that low simply by chance<\/strong>.<\/p>\n<div>\n<img src=\"https:\/\/i1.wp.com\/neptune.ai\/wp-content\/uploads\/RL-training-results-8.png\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p><em>Figure 12. SAC training on the MuJoCo Ant environment. All runs have the same hyper-parameters, only different seeds. Three runs, three results.<\/em><\/p>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p><a href=\"https:\/\/www.alexirpan.com\/2018\/02\/14\/rl-hard.html\" rel=\"noopener noreferrer\" target=\"_blank\">Deep Reinforcement Learning Doesn\u2019t Work Yet<\/a>\u00a0and so your agent might fail to train anything, even if your implementation is correct. It can simply fail by chance e.g. because of unlucky initialization (see Figure 12).<\/p>\n<p>\u00a0<\/p>\n<h3>Conclusions<\/h3>\n<p>\u00a0<br \/>Now you know\u00a0<strong>what and why you should log<\/strong>\u00a0to get the full picture of an agent training process. Moreover, you know what to look for in these logs and even how to deal with the common problems.<\/p>\n<p>Before we finish, please take a look at Figure 12 once again. We see that the training curves, though different, follow similar paths and even two out of three converge to a similar result. Any ideas what that could mean?<\/p>\n<p>Stay tuned for future posts!<\/p>\n<p>\u00a0<br \/><b>Bio: <a href=\"https:\/\/piojanu.netlify.app\/\" target=\"_blank\" rel=\"noopener noreferrer\">Piotr Januszewski<\/a><\/b> is a Research Software Engineer at University of Warsaw and PhD student at Gdansk University of Technology.<\/p>\n<p><a href=\"https:\/\/neptune.ai\/blog\/how-to-make-sense-of-the-reinforcement-learning-agents-what-and-why-i-log-during-training-and-debug\" target=\"_blank\" rel=\"noopener noreferrer\">Original<\/a>. Reposted with permission.<\/p>\n<p><b>Related:<\/b><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/www.kdnuggets.com\/2020\/10\/make-sense-reinforcement-learning-agents.html<\/p>\n","protected":false},"author":0,"featured_media":7967,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/7966"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=7966"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/7966\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/7967"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=7966"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=7966"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=7966"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}