Random notes mostly on Machine Learning

Not every REINFORCE should be called Reinforcement Learning

Deep RL is hot these days. It’s one of the most popular topics in the submissions at NeurIPS / ICLR / ICML and other ML conferences. And while the definition of RL is pretty general, in this note I’d argue that the famous REINFORCE algorithm alone is not enough to label your method as a Reinforcement Learning one.


REINFORCE is a method introduced by Ronald Williams, commonly cited as coming from “Simple statistical gradient-following algorithms for connectionist reinforcement learning”. Given a long and fruitful history of the method, it’s natural that it’s definition has evolved and for different people this method might mean somewhat different things, so let me first describe what I mean by the REINFROCE in this particular discussion1.

In this post we’ll assume REINFORCE to be equivalent to the score-function gradient estimator (also known as the log-derivative trick gradient estimator) with a certain (most likely constant2) baseline for variance reduction.

I don’t want to re-introduce this method (I believe I already did it quite some time ago), instead I refer an interested reader to a great blog post by Shakir Mohamed, where the score-function (gradient) estimator is explained.

What REINFORCE is used for

REINFORCE is used to estimate the gradients of the policy \(\pi_\theta(\tau)\) when dealing with the objectives of the following form3: \[ \mathop{\mathbb{E}}_{\pi_\theta(\tau)} R(\tau) \to \max_{\theta} \] The REINFORCE gradient estimator is then given by (where \(b\in\mathbb{R}\) is a baseline)\[ \left(R(\tau) - b\right) \nabla_\theta \log \pi_\theta(\tau), \quad\quad \text{where $\tau \sim \pi_\theta(\tau)$} \]The major benefits of this estimator are:

  • We don’t need to know the reward function \(R(\tau)\), we only need to evaluate it on the sampled trajectories \(\tau\).
  • There are no assumptions on \(R(\tau)\), it can be non-differentiable or even discontinuous.
  • Even the \(\tau\) itself could be discrete! We only need to log probability \(\log \pi_\theta(\tau)\) to be differentiable in \(\theta\) (but not in \(\tau\)).

The last two properties make the REINFORCE estimator an appealing choice for the gradient estimation in stochastic computation graphs, which I have written at length about.

There are lots of papers that do use REINFORCE in this exact scenario. For example, in a recent paper Data Valuation using Reinforcement Learning (DVRL) researchers from Google do exactly that: they define a certain stochastic computation graph that contains discrete binary random variables in it. Then a simple REINFORCE gradient estimator is used to train those layers which cannot be reached by the standard backpropagation.

Notably, the paper cites only one paper that has Reinforcement Learning in its title – the original one by Williams. Other than that it seems pretty disconnected from the RL literature. This hints a question: should it even be called to be “using RL”?

Communicative Value

Words are used to communicate ideas. When I say “Deep Neural Network” associations fire up in your brain and, provided you’re well-versed in the modern ML, you immediately think of all these modern (well, maybe not all of them) fancy things we call CNNs, ResNets, RNNs, LSTMs, Transformers, GNNs and many-many-many more. But I can also claim that a Logistic Regression (LR) is a special case of fully-connected neural networks, especially if you train them with stochastic optimization methods. But what’s the communicative value of this statement? What information does it convey? Does much of knowledge about LR generalize to Neural Nets? Or, does it benefit hugely from our modern Deep Learning toolkit? When was the last time you used batchnorm to train your Logistic Regression?

What I’m trying to say is that although LR can be technically categorized as a Neural Network, this categorization appears to be useless, it does not open any interesting knowledge / expertice transfer. However, stack a logistic classifier on top of a pre-trained neural network and train the whole pipeline end-to-end – and you’re in the #backpropaganda now!

Same goes for REINFORCE: the communicative value of calling methods like the aforementioned DVRL as “using RL” is very small. In my opinion, distinctive traits of (modern) Reinforcement Learning are:

  • Delayed rewards4
  • Unknown environment model
  • A single action at each state

When you say you “use RL” it should mean you’ve posed the problem at hand such that it benefits from the vast research produced by RL people that address these traits. It’s this connection that bears communicative value as now you know that advances in RL would translate to your problem, too.

If your problem lacks these traits and you go for RL methods anyway, you ignore much of the useful structure you have in your problem, constraining yourself to methods that are designed for a much harder problem. Keep in mind that RL is hard:

Perhaps a large body of RL work might be solving a problem you don’t even have to start with! Speaking of the REINFORCE method, it’s biggest problem is large variance, for which people have designed clever baselines, but in RL, one might argue, such baselines have limited value. On the other hand, Gumbel-Softmax (and relaxations in general) – a method one should almost always consider when thinking of training stochastic computation graphs with REINFORCE – is not applicable in the standard RL setting.

In the particular case of DVRL the problem has much more useful structure that can be used than the RL literature assumes. It has no delay in feedback, has fully known environment model and allows you to take multiple actions at each state – all of these imply you can do things RL people can’t afford. Unsurprisingly, this departure from the standard RL setting is reflected in the absence of RL works in the bibliographic selection.


There are other papers just like the DVRL that use REINFORCE to perform gradient estimation in models with discrete random variables and claim to be doing Reinforcement Learning. While possibly benefitting from all the hype around RL, this narrows the selection of methods to those designed for a much more general and hard problem. I hope I have convinced you that the Venn diagram for RL and REINFORCE should not have one containing the other.

  1. If to you REINFORCE means something different of something more than what I describe, then you’d probably agree with my claim. But anyway let me know in the comments below!

  2. The original REINFORCE did assume a certain (probably) constant baseline to be employed, but let’s assume that constant could be 0 to include vanilla score-function estimator as well.

  3. In the RL parlance \(\tau\) is a trajectory (sequence of state-action pairs) and \(R(\tau)\) is an unknown reward function, which is usually assumed to be comprised of individual rewards per each state-action pair: \[ R(\tau) = \sum_{(s_t, a_t) \in \tau} r_t(s_t, a_t) \]

  4. For this reason I don’t think bandits should be called RL either.

comments powered by Disqus