A while back I wrote a blog on understanding the fundamentals of RL. I’ve spent the past couple weeks reading through Kevin Murphy’s Reinforcement Learning textbook and Sutton and Barto to review some of my fundamentals. This blog contains some notes to cover topics I haven’t yet talked about in my first attempt at explaining RL!

What is Reinforcement Learning?

Reinforcement Learning is all about the idea of interacting with your environment to learn good behaviors. Given the full state $s_t$, observation $o_t$, some policy $\pi$, action $a_t = \pi(o_t)$, and reward $r_t$, the goal of an agent is to maximize the sum of its expected rewards:

$$V_\pi(s) = \mathbb{E}[\sum_{t=0}^{T}\gamma^tR(s_t, a_t) \mid s_0 = s] $$

Let’s clear up some notation definition here. $R(s_t, a_t)$ is defined as the reward we get for taking some action $a$ in a state $s$ at timestep $t$. $V_\pi$ is defined as the value function, which we can think of “given some observed state $s$, what is the expected reward from being in the state $s$ to the end of the episode if we follow policy $\pi$?”

The reason why we use an expectation in this equation is because your state-transition function and actions are stochastic, i.e. $a_t \sim \pi(s_t)$. What this means is that instead of directly taking an action from the policy, the outputs of our policy are mean $\mu$ and standard deviation $\sigma$, which we then model as a normal distribution to sample from i.e. $a_t \sim \mathcal{N}(\mu, \sigma^2)$.

If your state-transition function is stochastic as well (i.e. the dynamics of your environment are stochastic), then you would define the value function as an expectation. For example, in a simulator like IsaacLab, when you take an action (like moving a finger up/down), you might add noise to your action values, which would result in a non-deterministic state-transition function. So, in a sense, we can define the value function as the following to be more explicit:

$$V_\pi(s) = \mathbb{E}_{\substack{a_t \sim \pi(s_t) \\ s_{t+1} \sim P(s_t, a_t)}}[\sum_{t=0}^{T}\gamma^tR(s_t, a_t) \mid s_0 = s] $$

But what does this mean? Why is the reward at every timestep multiplied by some scalar $\gamma^t \in [0, 1]$? The main idea behind reward formulation is that earlier rewards provide more information to the policy compared to later rewards. Let’s fix $\gamma = 0.9$ and see how this rolls out for the first 3 timesteps:

$$\begin{aligned} V_\pi(s_0) &= 0.9^{0}\,R(s_0,a_0) + 0.9^{1}\,R(s_1,a_1) + 0.9^{2}\,R(s_2,a_2) \\ &= 1\cdot R(s_0,a_0) + 0.9\cdot R(s_1,a_1) + 0.81\cdot R(s_2,a_2) \end{aligned}$$

In the case where $\gamma = 1$, every timestep would have the same weight with respect to how its reward value impacts the value function:

$$\begin{aligned} V_\pi(s) &= \mathbb{E}[\sum_{t=0}^{T}1^tR(s_t, a_t) \mid s_0 = s] \\ &= \mathbb{E}[\sum_{t=0}^{T}R(s_t, a_t) \mid s_0 = s] \\ &= \mathbb{E}[R(s_0, a_0) + R(s_1, a_1) + ... + R(s_T, a_T) \mid s_0 = s] \\ \end{aligned}$$

And in the case where $\gamma = 0$, this would then mean that our value function is simply what the reward at the immediate next timestep is:

$$\begin{aligned} V_\pi(s) &= \mathbb{E}[\sum_{k=0}^{T}0^kR(s_k, a_k) \mid s_0 = s] \\ &= \mathbb{E}[0^{0}R(s_0, a_0) + 0^{1}R(s_1, a_1) + ... + 0^{T}R(s_T, a_T) \mid s_0 = s] \\ &= R(s_0, a_0) \\ \end{aligned}$$

The reward-to-go at some timestep $t$ is defined as the sum of the expected rewards from that timestep to the end of the episode, where each reward we get in the future is multiplied by a discount factor, $\gamma$:

$$G_t = \sum_{k=0}^{T-t-1} \gamma^kr_{t+k+1}$$

where $G_t$ is defined as the reward-to-go. Building on this definition, we can then say that the value function at some specific state in time is the same as the expected reward-to-go if we start at time $t$ and follow our policy $\pi$ (eq 1.9 of Murphy):

$$V_\pi(s_t) = \mathbb{E}[G_t \mid \pi]$$

Finding the optimal value of $\gamma$ usually depends on the type of environment you’re working on. For continual tasks (where your episodes can go on forever), having a higher $\gamma$ (like 0.99) can allow for a good balance between interpreting immediate rewards and long-term rewards from being in a certain state. The nice thing about the discount factor is that it ensures that our value is finite as our episode lengths approach infinity because

$$\lim_{t \rightarrow \infty} \gamma^t = 0$$

and by prioritizing short-term rewards, we’re able to teach the policy to reach its goal state faster.

Building on top of the idea of a value function, we also have something known as a Q-function, which is defined as

$$\begin{aligned} Q_\pi(s_t, a_t) &= \mathbb{E}_{\substack{a_k \sim \pi(s_k) \\ s_{k+1} \sim P(s_k, a_k)}}[\sum_{k=0}^{T-t-1} \gamma^kr_{t+k+1} \mid s_t = s, a_t = a] \\ &= \mathbb{E}_{\substack{a_k \sim \pi(s_k) \\ s_{k+1} \sim P(s_k, a_k)}}[G_t \mid s_t = s, a_t = a] \end{aligned}$$

where we take the expected return from an episode given some starting state $s_t$ and a starting action $a_t$ and then roll out policy $\pi$ after. This allows for us to see some really nice properties, such as the fact that the value function can be thought of as the Q-function multiplied by the probability of a certain action over the sum of all actions:

$$V_\pi(s) = \sum_{a \in A}\pi(a \mid s)Q_\pi(s, a)$$

We can also leverage the definitions of the Q-function and Value function to define the Advantage function, which simply tells us “given some state $s$, how much better is my expected reward if I step with some action $a$?”

$$A_\pi(s_t, a_t) = Q_\pi(s_t, a_t) - V_\pi(s_t)$$

Markov Decision Processes

Markov Decision Processes (MDPs) are at the core of Reinforcement Learning. MDPs are essentially sequential decision-making frameworks that we use to model the dynamics of our environment. We define a Markov Decision Process as the following:

Given current state $s_t$, policy actions $a_t$, reward $r_t$, and a state-transition model $P(s_{t+1} \mid s_t, a_t)$, and discount factor $\gamma$, Markov Decision Processes satisfy the following property:

$$P(s_{t+1} \mid s_t) = P(s_{t+1} \mid s_0, s_1, ..., s_t)$$

What this means is that the future state $s_{t+1}$ is conditionally independent of all past states given the current state $s_t$. In other words, the current state contains all the information needed to predict the future.

Partially Observable Markov Decision Processes (POMDP) are an extension of MDPs where instead of passing in the world state $s_t$ into the policy directly, we take a subset of these states (called observations, $o_t$) that get fed into the policy. For example, if I had state observations such as friction coefficients, position, distance to goal, and joint velocities, I might train my model on position-only data and its distance to goal. This then means that our state $s_t$ and policy observations $o_t$ would look like the following:

$$\begin{aligned} s_t &= \text{friction coefficients, position, joint velocities, distance to goal} \\ o_t &= \text{position, distance to goal} \end{aligned}$$

Bellman Equations

Bellman Equations are a set of equations that describe how our value function and policy interact together to find the best possible RL agent. Note that we define $V^*(s)$ as the optimal value function, $\pi^*(s)$ as the optimal policy, and $Q^*(s)$ as the optimal Q-function that we learn. Solving for $V^*, Q^*, \pi^*$ is defined as policy optimization while solving for $V_\pi, Q_\pi$ given some policy $\pi$ is known as policy evaluation.

There are 2 Bellman Equations that govern how our policy learns in our environment:

$$\begin{aligned} V^*(s_t) &= \max_{a_t} R(s_t, a_t) + \mathbb{E}_{\substack{s_{t+1} \sim P(s_t, a_t)}}[\gamma V^*(s_{t+1})] \\ Q^*(s_t, a_t) &= R(s_t, a_t) + \mathbb{E}_{\substack{s_{t+1} \sim P(s_t, a_t)}}[\max_{a_{t+1}} \gamma Q^*(s_{t+1}, a_{t+1})] \end{aligned}$$

We then use the solutions to these equations to help define the optimal policy we can deploy (eqs 2.7 + 2.8 of Murphy):

$$\begin{aligned} \pi^*(s) &= \argmax_a Q^*(s, a) \\ &= \argmax_a [R(s, a) + \gamma\mathbb{E}_{p_S(s' \mid s, a)}[V^*(s')]] \end{aligned}$$

Notice how both these equations are defined recursively; finding the optimal Value function or Q-function at some timestep is dependent on the value (or the Q-value) at the next time step!

Most RL approaches can essentially be broken down into value-based reinforcement learning and policy-based reinforcement learning. In value-based RL, we are trying to find the optimal value function which we then use to derive the optimal policy. In policy-based RL, we are directly trying to learn the policy as it iterates through episodes in training.

Value-based RL

Revisiting the Bellman equation above, we know

Value Iteration

One of the most fundamental ideas in Value-Based RL is the idea of value iteration. Value iteration leverages Dynamic Programming to solving any MDP. Given some initial value function estimate $V_0$, the update rule, also known as the Bellman Backup, is defined as (eq 2.9 from Murphy):

$$V_{k+1}(s) = \max_a [R(s, a) + \gamma \sum_{s'}p(s' \mid s, a)V_k(s')]$$

What this means is that essentially we are looking for the best action over all of our states that will maximize the future discounted sum of rewards (you can think of this as a breadth-level sweep across all possible states). Once we find the best action, we step forward with it, update our policy $\pi$, and repeat this process. Note that $\pi$ is also deterministic here as we are simply taking the action that gives us the maximum reward to go. This means that you aren’t training a separate policy network here. We then repeat this loop until all states have been explored and the value function (and therefore policy) have converged.

This is what the algorithm looks like in pseudo-code:

Initialize our value network $V$, possible states $S$, convergence threshold $\delta$.
While $\text{err}(V_t) > \delta$:
1. for s in S:
  1. Define $v_\text{old} = V_\pi(s)$
  2. Define Bellman backup $V_\pi(s) = \max_a\{R(s,a) + \gamma \sum_{s'}p(s' \mid s, a)V_\pi(s')\}$
  3. Update $\text{err}(V_t) = V_\pi(s) - v_\text{old}$
Define optimal policy $\pi^* =\argmax_a [R(s, a) + \gamma \sum_{s'}p(s' \mid s, a)V_k(s')]$ from the optimal value function $V^*$.

Policy Iteration

Policy Iteration is another Dynamic Programming method that we use when trying to find a deterministic policy. It consists of 2 steps: policy evaluation and policy improvement.

Policy evaluation

In the policy evaluation step, we calculate the value function of the current policy. Let $v(s) = V_\pi(s)$ be a vector that stores the reward-to-go at each state, $r(s) = \sum_{a \in A}\pi(a \mid s) R(s, a)$, and our state transition matrix $P(s' \mid s) = \sum_{a \in A} \pi(a \mid s)p(s' \mid s, a)$. We can then write the Bellman equation for policy evaluation as the following:

$$v = r + \gamma Pv$$

We can find a solution to this equation with some linear algebra:

$$\begin{aligned} v &= r + \gamma Pv \\ \implies r &= v - \gamma Pv \\ &= (I - \gamma P)v \\ \end{aligned}$$

Which then gives us the following solution:

$$v = (I - \gamma P)^{-1}r$$

What does this actually mean? Given a vector containing our immediate rewards and our state transition matrix, we can essentially compute the value function at each time step, which would mean that you solve for the value function with respect to a certain policy $\pi$. We then would perform policy improvement to ensure that our policy $\pi$ moves in the direction that maximizes the expected discounted sum of rewards.

Note that we can also solve for the value function through an iterative process as well, where you solve $v_{t+1} = r + \gamma Pv_t$ until convergence.

Policy improvement

Once we have solved for the value function, solving for the optimal policy $\pi$ is pretty straightforward: we simply act greedily with respect to the value function. This would mean that our policy $\pi$ would be updated with the following definition (eq. 2.12 of Murphy):

$$\pi'(s) = \argmax_a\{R(s,a) + \gamma\mathbb{E}[V_\pi(s')]\}$$

Hopefully the difference between Value iteration and Policy iteration is pretty clear now! To recap, in value iteration, we are looking for the optimal action we can take that maximizes our future rewards (to give us the optimal value function $V^*$) and to derive the optimal policy $\pi^*$. In Policy iteration, we switch between updating the value function $V_\pi$ and the current policy $\pi'$ at every (or every $n$) time step.

TD Learning

Bellman equations are inherently recursive, which means that we need to find a way to reduce the Bellman error without having to roll out long trajectories. Given some arbitrary state-action-reward-state sample $(s_t, a_t, r_t, s_{t+1})$, we define the value function update step as

$$V(s_{t}) \leftarrow V(s_{t}) + \alpha(R(s_t, a_t) + \gamma V(s_{t+1}) - V(s_{t}))$$

where $\alpha$ is our learning rate. We also define $y_t = R(s_t, a_t) + \gamma V(s_{t+1})$ as the target and $\delta_t = y_t - V(s_{t})$ as the TD error. Note that ideally, we want to have a Bellman residual of 0, which would mean that our value function accurately estimates the reward-to-go of the entire episode (as the sum of the reward of our current timestep and the reward-to-go at the next timestep). In a sense, you can kind of think of the Bellman error as a form of a loss function, where our goal is to optimize $V$ by minimizing the Bellman error.

Our definition of the update step changes if we’re using non-linear approximators (like neural networks). We have to account for the gradient of the parameters of our network $\theta$, which would mean that our value function parameter update step would look like:

$$V_\theta(s_{t}) \leftarrow V_\theta(s_{t}) + \alpha(R(s_t, a_t) + \gamma V_\theta(s_{t+1}) - V_\theta(s_{t}))\nabla_\theta V_\theta(s_{t})$$

where $\nabla_\theta V_\theta(s_{t})$ refers to the gradient of the value function network with respect to $\theta$. Note that this is called a semi-gradient (Murphy) since we do not take the gradient with respect to the target $y_t$ because it depends on $V(s_{t+1})$, which would result in this loop. Instead, we treat $y_t$ as a constant. Hopefully you can see how the value function is essentially bootstrapping here, since at every timestep, we want the value function to approximate the target $y_t$ (which in itself is defined by the value function).

TD Learning alone is the equivalent of the Policy evaluation part of Policy iteration. To improve the policy, you can follow the idea of policy improvement where we want to take actions with policy $\pi$ that result in maximizing the expected reward-to-go, or learn the Q function, where we could use either on policy methods (SARSA) and off-policy methods (Q-Learning).

Monte-Carlo Learning

Monte-Carlo (MC) control is focused on rolling out a fixed policy, and then using the average returns from multiple visits to a state (across episodes). We then use the average return (known as the MC estimate) to update the value function as done below (eq 2.16 of Murphy):

$$V(s_t) \leftarrow V(s_t) + \alpha[G_t - V(s_t)]$$

where we define $\alpha$ as the learning rate. Something to note is that even though you are rolling out one singular policy, the whole idea of an “average” here is based on the fact that you’re averaging over the estimated return when you start at some certain state. For example, if I was in some state $s_0$, transitioned to state $s_1$ and then back to $s_0$, we would get two estimates for what the return $G_t$ would be (since we visit $s_0$ twice). So, we take the average of $G_t$ to ensure that $V(s_0)$ is able to converge to this expected return under the policy.

Once our value function has converged, we can then use the same approach to solve for the Q function $Q_\pi(s,a) = \mathbb{E}[G_t \mid s_t=s, a_t=a]$, where we start our rollout with some action a. So instead of just exploring all states, we explore all states and take all possible actions in each state, giving us the following update rule:

$$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha[G_t - Q(s_t, a_t)]$$

To learn an optimal policy, we can combine both our MC estimate of Q with the idea of policy iteration. At some iteration $k$, we can define the updated policy as

$$\pi_{k+1} = \argmax_a Q_k(s, a)$$

where we define $Q_k$ to be the MC estimated Q value. To ensure that you can reach every state-action pair, you can use an $\epsilon$-greedy policy, where you define the action you step with in an environment as the following: $a_t = a_\pi$ with probability $1-\epsilon$ and $a_t = \text{random action}$ with probability $\epsilon$.

In practice, we can just estimate $Q(s, a)$ and use the Q-function to estimate the value function (if needed). This is what the training loop would essentially look like:

Randomly initialize Q-function $Q(s, a),$ policy $\pi$ to be completely stochastic, and $\epsilon$-greedy w.r.t. $Q$.
Define $\mathcal{D} = \{\}$ to be our episode buffer containing state-action-reward pairs, where $|D| = T$.
while $Q(s, a)$ not converged:
1. Create episode buffer $\mathcal{D} = \{(s_0, a_0, r_1), ..., (s_{T-1}, a_{T-1}, r_{T})\}$ by following $\pi$ for a full episode.
2. for each $(s_t, a_t, r_{t+1})$ $ \in \mathcal{D}$:
  1. calculate $G_t = \sum_{k=0}^{T-t-1} \gamma^kr_{t+k+1}$.
  2. update $Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha[G_t - Q(s_t, a_t)]$
3. for each unique state $s$ visited in the episode:
  1. update $\pi = \argmax_a Q_k(s, a)$ with probability $1-\epsilon$ and $a_t = \text{random action}$ with probability $\epsilon$

TD-$\lambda$

TD-Learning and Monte-Carlo are pretty different. In TD-Learning, what we’re doing is estimating the return at every time step and then using that bootstrapped estimate to update the value function. In Monte-Carlo Learning, we instead roll out the entire episode to estimate $G_t$ and then use that value to update the value (or Q) function. What this means is that TD-Learning is very effective for longer episodes because you can update your value function at every step, whereas in MC control you would have to wait the entire episode so that you can update your network with respect to $G_t$.

Because TD-Learning happens at every timestep, there is a higher probability you can introduce bias, whereas in MC control you would have much more variance in your updates. This is because in TD-Learning, in the case that your updated target ($R(s_t, a_t) + \gamma V(s_{t+1})$) is incorrect, you end up rolling out this bootstrapped guess throughout the rest of your episode. In MC, because each rolled out episode can result in drastically different behavior, $G_t$ can take on a wide range of values, resulting in high variance.

We can do is combine the benefits of TD-Learning (fast updates and low variance) and MC (less bias) to train a policy that leverages this. Instead of directly estimating $G_t$, we can instead calculate an $n$-step return and then use the value function to estimate the return from the episode. This means that we can define $G_t$ as the following (eq 2.19 of Murphy):

$$G_t = r_t + \gamma r_{t+1} + .. + \gamma^{n-1} r_{t + n -1} + \gamma^n V(s_{t + n})$$

Assume we model the value function as a neural network, $V_\theta$, where $\theta$ denotes the weights of our neural network. At every iteration, we define the n-step TD update of $\theta$ as:

$$\theta \leftarrow \theta + \alpha[G_{t:t+n} - V_\theta(s_t)]\nabla_\theta V_\theta(s_t)$$

where $\alpha$ is our learning rate. But the equation above is specified for the return between some timestep $t$ and some timestep $t+n$. What we can do instead is take a weighted average of the n-step returns at every timestep between $t$ and $t+n$ for a more accurate estimate, which we define as the lambda return (eq 2.23 of Murphy):

$$G_t^{\lambda} \triangleq (1-\lambda) \sum_{n=1}^\infty \lambda^{n-1}G_{t:t+n}$$

where $\lambda \in [0, 1]$. Something kinda cool here is that setting $\lambda=0$ gives us the standard TD-Learning algorithm:

$$\begin{aligned} G_t^{0} &= (1-0) \sum_{n=1}^\infty 0^{n-1}G_{t:t+n} \\ &= 1[0^{1-1}G_{t:t+1} + \sum_{n=2}^\infty 0^{n-1}G_{t:t+n}]\\ &= 1[G_{t:t+1} + 0]\\ &= G_{t:t+1}\\ \end{aligned}$$

In the case where $\lambda = 1$, we observe that this would be equivalent to taking the standard episodic Monte-Carlo return, where our episode has some fixed length $T$:

$$\begin{aligned} G_t^{1} &= (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1}G_{t:t+n} \\ &= (1-\lambda)[\sum_{n=1}^{T-t-1} \lambda^{n-1}G_{t:t+n} + \sum_{n=T-t}^{\infty} \lambda^{n-1}G_{t:t+n}] \\ &= (1-\lambda)\sum_{n=1}^{T-t-1} \lambda^{n-1}G_{t:t+n} + (1-\lambda) \sum_{n=T-t}^{\infty} \lambda^{n-1}G_{t:t+n} \\ &= (1-\lambda)\sum_{n=1}^{T-t-1} \lambda^{n-1}G_{t:t+n} + (1-\lambda) \sum_{n=T-t}^{\infty} \lambda^{n-1}G_{t} \\ &= (1-\lambda)\sum_{n=1}^{T-t-1} \lambda^{n-1}G_{t:t+n} + G_{t}(1-\lambda) \sum_{n=T-t}^{\infty} \lambda^{n-1} \\ &= (1-\lambda)\sum_{n=1}^{T-t-1} \lambda^{n-1}G_{t:t+n} + G_{t}(1-\lambda) \sum_{n=0}^{\infty} \lambda^{n+T-t-1} \\ &= (1-\lambda)\sum_{n=1}^{T-t-1} \lambda^{n-1}G_{t:t+n} + G_{t}\lambda^{T-t-1}(1-\lambda) \sum_{n=0}^{\infty} \lambda^{n} \\ &= (1-\lambda)\sum_{n=1}^{T-t-1} \lambda^{n-1}G_{t:t+n} + G_{t}\lambda^{T-t-1}(1-\lambda) \frac{1}{1-\lambda} \\ &= (1-\lambda)\sum_{n=1}^{T-t-1} \lambda^{n-1}G_{t:t+n} + G_{t}\lambda^{T-t-1} \\ &= (1-1)\sum_{n=1}^{T-t-1} 1^{n-1}G_{t:t+n} + G_{t}1^{T-t-1} \\ &= G_{t} \end{aligned}$$

since we know $\forall n \geq T-t, G_{t:t+n} = G_t$ (once we reach terminal timestep $T$, any $n$-step return past that is the same as $G_t$). Since we also know $\lambda \in (0, 1)$, we leverage the lemma $\sum_{n=0}^\infty (1-\lambda)\lambda^n = 1$ to seperate the returns in 2 intervals: between $t:t+n$ and $T-t:\infty$.

Just like most of the other algorithms, we can also define $G_t^\lambda$ with a recursive equation, which would look like the following (eq 2.25 of Murphy):

$$G_t^\lambda = r_t + \gamma[(1-\lambda)V_{t+1} + \lambda G_{t+1}^\lambda]$$

One of the issues with TD-$\lambda$ is that calculating $G_t$ is dependant on returns in future states $G_{t:t+n}$, which means that TD-$\lambda$ cannot be used as an online algorithm. A solution to this is in the idea of eligibility traces, where we can take a weighted sum of the gradients of the value function to update the value network. The reason why we can assume that such a gradient update would be equivalent to rolling out the return $G_t^\lambda$ is because as you progress through your episode, future gradients are directly impacted by gradients from previous timesteps, allowing for a weighted accumulation style similar to $G_t^\lambda$ (eq. 2.26 of Murphy):

$$z_t = \gamma \lambda z_{t-1} + \nabla_\theta V_\theta(s_t)$$

If you roll this equation out, it becomes pretty clear how gradients from previous timesteps impact the gradients in future timesteps:

$$z_t = \nabla_\theta V_\theta(s_t) + \gamma \lambda [\nabla_\theta V_\theta(s_{t-1}) + \gamma \lambda [\nabla_\theta V_\theta(s_{t-2}) + ... ]]$$

This also gives us the following online TD-$\lambda$ update step:

$$\theta \leftarrow \theta + \alpha[r_{t+1} + \gamma V_\theta(s_{t+1}) - V_\theta(s_t)]z_t$$

SARSA

TD-Learning is all about policy evaluation. SARSA is an online policy improvement algorithm that we can use to find the optimal policy $\pi^*$. Instead of trying to solve for the value function, solving for the Q-function would make it extremely easy to find $\pi^*$ (as we simply act greedily with respect to Q). Given a state-action-reward-state-action tuple $(s, a, r, s', a')$, we define the TD-update rule for the Q function as (eq 2.28 of Murphy):

$$Q(s, a) \leftarrow Q(s, a) + \alpha[r + \gamma Q(s', a') - Q(s, a)]$$

where $a' \sim \pi(s')$. We do this until all states and actions have been visited (which you could do with an $\epsilon$-greedy policy), to find $Q^*$ and $\pi^*$. Notice how the algorithm is called SARSA because of our tuple $s, a, r, s', a'$ :P

Q-Learning

SARSA can be modified to be an off-policy algorithm where instead of sampling $a' \sim \pi(s')$, we define $a'$ to be the action that maximizes the Q value given the next state: $a' = \argmax_a Q(s', a)$. What this then means is that given a state-action-reward-state tuple $(s, a, r, s')$, the Q-function update is defined as (eq 2.32 of Murphy):

$$Q(s, a) \leftarrow Q(s, a) + \alpha[r + \gamma \max_{a'}Q(s', a') - Q(s, a)]$$

This update rule is defined as Q-Learning for the “tabular case.” Note that for terminal states, $Q(s,a) = 0$ for any action $a$. The overall Q-Learning algorithm can be thought of as follows:

Initialize Q-function $Q(s, a),$ policy $\pi$, and $\epsilon$-greedy w.r.t. $Q$. Note that the Q-function generally in the tabular case is initialized to be all zeros.
while $Q(s, a)$ not converged:
1. Given initial state $s$ of episode $E$:
  1. Sample $a = \argmax_a Q(s, a)$ with probability $1-\epsilon$, and $a = \text{random action}$ with probability $\epsilon$.
  2. step with action $a$ in the environment to get $(s', r)$.
  3. update $Q(s, a) \leftarrow Q(s_t, a_t) + \alpha[r + \gamma \max_{a'}Q(s', a') - Q(s, a)]$
Define $\pi^* = \argmax_a Q(s, a)$.

Note that the reason why you’re not using $\pi$ to help converge the Q-function is because we are dealing with the case of tabular Q-Learning, where $\gamma \max_{a'}Q(s', a')$ is defined in the Q-table. The goal of tabular Q-Learning is to “fill” this table out so that we correctly choose the optimal action that maximizes the Q value.

Instead of having to create a Q-table where we can store the Q value for every state and action, we can approximate the Q value using neural networks as well! What this would mean is that the residual would be the basis of our loss function (eq 2.34 of Murphy):

$$\mathcal{L}(\theta \mid s, a, r, s') = [r + \gamma(1 - \text{done}) \ \text{sg}[\max_{a'}Q_\theta(s',a')] - Q_\theta(s, a)]^2$$

where we can then use a standard gradient descent algorithm (like SGD or Adam) to find the optimal Q-function $Q^*$ (note that $\text{sg}$ denotes the stop-gradient operator). The overall training algorithm would look like as follows:

Initialize Q-function $Q_\theta (s, a),$ policy $\pi = \epsilon\text{Uniform}(a) + (1-\epsilon)\delta (a = \argmax_a Q_\theta(s, a))$, and $\epsilon$-greedy w.r.t. $Q$. Define replay buffer $\mathcal{D}$, $S$ to be the length of an episode.
for each epoch $n = 0, 1, ..., N$:
1. Given initial state $s$ of episode $E$:
  1. sample $a \sim \pi_k(a \mid s)$.
  2. step with action $a$ in the environment to get $(s', r, \text{done})$.
  3. update dataset buffer $D \leftarrow D \cup {(s, a, s', r, \text{done})}.$
2. sample batch $B \subset D$.
3. calculate loss $\mathcal{L} (B, \theta) = \frac{1}{|B|} \sum_{(s,a,s',r, \text{done}) \in B}[r + \gamma(1 - \text{done}) \ \text{sg}[\max_{a'}Q_\theta(s',a')] - Q_\theta(s, a)]^2$.
4. update parameters $\theta \leftarrow \theta - \alpha \nabla_\theta \mathcal{L}(B, \theta \mid s, a, r, s')$.

In the case where the action space is discrete, we can simply define $\pi^* = \argmax_a Q^*(s, a)$ and calculate the Q values for each possible action. In the continuous case though, we have to learn a separate policy network $\pi (s)$ to approximate $\argmax_a Q^*(s, a)$. Note that we also define $\mathcal{D}$ that we use in training as the experience replay buffer, where we collect and store trajectories of the agent. In general, because Q-Learning is off-policy, this means that we can store any data into Q (such as data that doesn’t come in sequential timesteps).

One of the biggest issues with Q Learning is in the way that we define $\mathcal{L}(\theta)$; the target that we’re chasing is defined with the same parameters as the network we’re trying to update, which means that we would get stuck in this endless loop of chasing a target we can never converge to. A very simple solution that can work is simply maintaining an extra copy of the Q-network parameters, $Q_{\theta^-}$ where

$$\theta(r, s' \mid \theta^-) = r + \gamma \max_{a'}Q_{\theta^-}(s', a')$$

where we can update $\theta^-$ after every set number of episodes. Another popular alternative is using an exponential moving average (EMA) of the weights, which would mean

$$\bar{\theta} = \alpha\bar{\theta} + (1 - \alpha)\text{sg}(\theta)$$

where $\alpha \in (0, 1)$. This would then mean that $\mathcal{L}(\theta \mid s, a, r, s') = [y(r, s' \mid \bar{\theta}) - Q_\theta(s, a)]^2$.

Double Q-Learning

Another popular approach for stable Q-Learning is to have two separate Q-functions, $Q_1$ and $Q_2$, where one selects your action and the other estimates the Q value. Why? Maximization bias. The issue with naive Q-Learning comes down to Jensen’s inequality: $\mathbb{E}[f(X)] \geq f(\mathbb{E}[X])$, which implies that $\mathbb{E}[\max_a Q(s, a)] \geq \max_a \mathbb{E}[Q(s, a)]$. What this means is that our Q-networks will naively overestimate the expected Q value for the predicted optimal action, resulting in a network with high bias. Having two separate networks allows for a much lower bias as the probability of both networks overestimating the same action is much lower.

We use the following set of equations to update both our Q networks (expanding eq 2.42 and 2.43 of Murphy):

$$\begin{aligned} Q_1(s, a) &\leftarrow Q_1(s, a) + \alpha[y_1(s, a) - Q_1(s, a)]\\ y_1(s, a) &= r + \gamma Q_1[s', \argmax_{a'}Q_2(s', a')] \\ Q_2(s, a) &\leftarrow Q_2(s, a) + \alpha[y_2(s, a) - Q_2(s, a)]\\ y_2(s, a) &= r + \gamma Q_2[s', \argmax_{a'}Q_1(s', a')] \end{aligned}$$

In the case where we’re using neural networks to model our Q-functions, then the training target changes to be the following:

$$y(r, s' \mid \theta, \bar{\theta}) = r + \gamma Q_{\bar{\theta}} [s', \argmax_{a'} Q_\theta (s', a')]$$

Policy-Based Reinforcement Learning

We’ve talked about ways we can solve for the value function and the Q function, and use these functions to derive our action policy $\pi$. A very obvious drawback to some of these methods (like MC Learning and Policy Iteration) is that applying these methods in continuous action spaces are extremely difficult and computationally expensive. Policy-Based RL is all about using the gradient of our loss to maximize the expected return, where we denote $\pi_\theta(a \mid s)$ to be the policy network used to predict actions in a continuous space.

One of the fundamental ideas of policy gradient methods is the idea of the value of a policy, $J(\pi_\theta)$, which we define as the expected rewards over some trajectory that’s from the world model (eq 3.1 of Murphy):

$$J(\pi_\theta) = J(\theta) = \mathbb{E}_{p_\theta(\tau)}[R(\tau)], \ R(\tau) = \sum_{i=0}^{\infty} \gamma^i r_i$$

where R is the weighted sum of our rewards. We define $p_\theta(\tau)$, the distribution of trajectories that result from the policy, as the following (eq 3.2 of Murphy):

$$p_\theta(\tau) = p(s_1) \prod_{k=1}^{T} \mathcal{T}(s_{k+1} \vert s_k, a_k) \pi_\theta (a_k\vert s_k)$$

What we’re essentially saying here is that the probability of some specific trajectory $\tau$ induced by $\pi_\theta$ is equivalent to the probability of being in some starting state, which we then multiply with by the probability of reaching some state $s_{k+1}$ given the action the policy takes at state $s_k$. You can think of trajectories as a set of states visited and actions taken at every timestep (so $\tau = {(s_1, a_1), (s_2, a_2), ..., (s_T, a_T)}$).

Now let’s take the gradient of the value of the policy:

$$\begin{aligned} & \nabla_\theta J(\theta) \\ &= \int \nabla_\theta p_\theta(\tau)R(\tau)\,d{\tau} \\ &= \int p_\theta(\tau) \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)}R(\tau)\,d{\tau} \\ &= \mathbb{E}_{p_\theta(\tau)}[\frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)} \ R(\tau)] \\ &= \mathbb{E}_{p_\theta(\tau)}[\nabla_\theta \log [p_\theta(\tau)] \ R(\tau)] \end{aligned}$$

What we’re doing here is unravelling the expectation into an integral, and then doing some algebraic manipulation to instead model the gradient of the policy value as the expectation of the normalized gradient. Using the algebraic property $\frac{\nabla \pi}{\pi} =\nabla \log(\pi)$, we’re able to instead model the gradient of the value as an expectation over the log-probability of that trajectory occuring multiplied by the reward of the trajectory.

When I first saw this equation, I was really confused with the idea of modelling the gradient of the policy value as an expectation over log-probabilities. Why would we want to do that? In practice, differentiating through the environment is not possible. Recall that $p_\theta(\tau)$ is dependent on the transition model $\mathcal{T}$ and the policy $\pi_\theta$. Although we can differentiate with respect to $\pi_\theta$, we cannot differentiate with respect to the transition model $\mathcal{T}$ as we often don’t know what $\mathcal{T}$ is.

By choosing to model the gradient as an expectation over log-probabilities, we’re able to make the math much simpler (since we know $\log(xy) = \log x + \log y$) to solve for $\nabla_\theta \log [p_\theta(\tau)]$:

$$\begin{aligned} \nabla_\theta \log [p_\theta(\tau)]& = \nabla_\theta \log [p(s_1) \prod_{k=1}^{T} \mathcal{T}(s_{k+1} \vert s_k, a_k) \pi_\theta (a_k\vert s_k)] \\ &= \nabla_\theta \left( \log [p(s_1)] + \log[\prod_{k=1}^{T} \mathcal{T}(s_{k+1} \vert s_k, a_k) \pi_\theta (a_k\vert s_k)] \right) \\ &= \nabla_\theta \left( \log [p(s_1)] + \sum_{k=1}^{T} \log[\mathcal{T}(s_{k+1} \vert s_k, a_k)] + \sum_{k=1}^{T} \log[\pi_\theta (a_k\vert s_k)] \right) \\ &= \sum_{k=1}^{T} \nabla_\theta \log[\pi_\theta (a_k\vert s_k)] \end{aligned}$$

Notice how because we modelled the expectation with logarithms, we’re able to cancel out $\mathcal{T}$ because it isn’t dependent on $\theta$ ($\nabla_\theta \mathcal{T} = 0$). A way I like to think about the intuitive meaning behind $\nabla_\theta \log[\pi_\theta (a_k\vert s_k)]$ is “how much does the probability of choosing the same action $a_k$ change if we change the weights of $\theta$ a little bit?” What this would mean that if $\nabla_\theta \log[\pi_\theta (a_k\vert s_k)]$ is high, then updating the parameter in that direction would result in a higher probability of picking $a_k$.

Putting this together, we can now define $\nabla_\theta J(\theta)$ as:

$$\nabla_\theta J(\theta) = \mathbb{E}_{p_\theta(\tau)} [\sum_{k=1}^{T} \nabla_\theta \log[\pi_\theta (a_k \vert s_k)]R(\tau)] $$

One of the issues of this definition can be seen once we start expanding it out:

$$\begin{aligned} \nabla_\theta J(\theta) &= \mathbb{E}_{p_\theta(\tau)} [\sum_{k=1}^{T} \nabla_\theta \log[\pi_\theta (a_k \vert s_k)]R(\tau)]\\ &= \mathbb{E}_{p_\theta(\tau)} \left[\sum_{k=1}^{T} \nabla_\theta \log[\pi_\theta (a_k \vert s_k)] \sum_{k=1}^{T} r_k \gamma^{k-1} \right] \\ &= \mathbb{E}_{p_\theta(\tau)} \left[r_1 \sum_{k=1}^{T} \nabla_\theta \log[\pi_\theta (a_k \vert s_k)] + r_2 \gamma \sum_{k=1}^{T} \nabla_\theta \log[\pi_\theta (a_k \vert s_k)] + ... + r_T \gamma^{T-1} \sum_{k=1}^{T} \nabla_\theta \log[\pi_\theta (a_k \vert s_k)] \right] \end{aligned}$$

Let’s just pull apart the first term, $r_1 \sum_{k=1}^{T} \nabla_\theta \log[\pi_\theta (a_k \vert s_k)]$ here:

$$\begin{aligned} &r_1 \sum_{k=1}^{T} \nabla_\theta \log[\pi_\theta (a_k \vert s_k)]\\ &= \nabla_\theta \log[\pi_\theta (a_1 \vert s_1)] r_1 + \nabla_\theta \log[\pi_\theta (a_2 \vert s_2)] r_1 ... + \nabla_\theta \log[\pi_\theta (a_T \vert s_T)] r_1 \end{aligned}$$

How can it be that the first reward, $r_1$, is dependant on the gradients of the policy at future timesteps? Recall that the value of a policy, $J(\pi_\theta)$ is defined to be the the expected rewards from some trajectory we roll out. This means that the gradient of this function tells us how changes in the policy can influence the reward. It cannot make sense that policy changes in future (later) timesteps impact the calculations of the reward at $r_1$.

We know that policy gradients in previous states can impact rewards in future states, which means that we can redefine the value gradient as the following (eq 3.16 of Murphy):

$$\begin{aligned} \nabla_\theta J(\theta) &= \mathbb{E}_{p_\theta(\tau)} \left[ \sum_{k=1}^{T} \nabla_\theta \log[\pi_\theta (a_k \vert s_k)] \sum_{l=k}^{T} r_l \gamma^{l-1} \right] \\ &= \mathbb{E}_{p_\theta(\tau)} \left[ \sum_{k=1}^{T} \nabla_\theta \log[\pi_\theta (a_k \vert s_k)] \gamma^{k-1} \sum_{l=k}^{T} r_l \gamma^{l-k} \right] \\ &= \mathbb{E}_{p_\theta(\tau)} \left[ \sum_{k=1}^{T} \nabla_\theta \log[\pi_\theta (a_k \vert s_k)] \gamma^{k-1} G_k \right] \end{aligned}$$

I would encourage you to expand the terms in the expectation out to see how $\sum_{l=k}^{T} r_l \gamma^{l-k}$ simplifies to $G_k$.

Policy Gradient Theorem

How would we define $p_\theta$ if our episodes could be infinitely long? Instead of attempting to model the distribution of trajectories over the policy, we instead model the probability of reaching some certain state at time $t$, known as the discounted state visitation measure (eq 3.21 of Murphy):

$$\rho_\theta^\gamma(s) = \sum_{t=0}^{\infty} \gamma^t \sum_{s_0} p_0(s_0) p_\theta(s_0 \rightarrow s, t)$$

where we define $p_\theta(s_0 \rightarrow s, t)$ as the probability of us reaching state $s$ at timestep $t$ when we start with state $s_0$ and roll out the policy $\pi$. An important note here is that $\rho_\theta^\gamma(s)$ is not an actual probability, but a (unnormalized) measure. This is mainly because of the fact that we can see the same state multiple times in the same episode; we’re not asking the probability that we see a state in an episode, but rather a way to quantify the occurances of seeing some state in the episode.

The more obvious reason is because $\sum_{s} \rho_{\theta}(s) = \frac{1}{1 - \gamma}$. In order for the discounted state visitation measure to be a probability, all values must be non-negative (which is true), and all values must sum to 1 (which is not true). To turn this measure into a probability, we can simply multiply by $1-\gamma$ (remember in the MC section how taking the limit of this multiplied by $\gamma$ would equal 1 ;):

$$p_\theta^\gamma(s) = (1-\gamma) \rho_\theta^\gamma(s)$$

Let’s go back to policy gradients for a second. We know that $\nabla_\theta J(\theta)$ is defined with respect to the reward-to-go i.e. $\mathbb{E}_{p_\theta(\tau)} \left[ \sum_{k=1}^{T} \nabla_\theta \log[\pi_\theta (a_k \vert s_k)] \gamma^{k-1} G_k \right]$. If we expand this expectation, we get:

$$\begin{aligned} \nabla_\theta J(\theta) &= \mathbb{E}_{p_\theta(\tau)} \left[ \sum_{k=1}^{T} \nabla_\theta \log[\pi_\theta (a_k \vert s_k)] \gamma^{k-1} G_k \right] \\ &= \mathbb{E}_{p_\theta(\tau)} \left[ \sum_{k=1}^{T} \nabla_\theta \log[\pi_\theta (a_k \vert s_k)] \gamma^{k-1} \right] \mathbb{E}\left[G_k \mid s_t = s, a_t = a \right] \\ &= \mathbb{E}_{p_\theta(\tau)} \left[ \sum_{k=1}^{T} \nabla_\theta \log[\pi_\theta (a_k \vert s_k)] \gamma^{k-1} \right] Q_\theta(s, a) \\ &= \mathbb{E}_{p_\theta(\tau)} \left[ \sum_{k=1}^{T} \nabla_\theta \log[\pi_\theta (a_k \vert s_k)] \gamma^{k-1}Q_\theta(s, a) \right] \end{aligned}$$

Notice how how the expected value of reward-to-go $G_k$, is exactly what $Q_\theta(s, a)$ is. Pair this with the unnormalized discounted state visitation measure, we get:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\rho_\theta^\gamma(s), \pi_\theta(a\vert s)} \left[ \sum_{k=1}^{T} \nabla_\theta \log[\pi_\theta (a_k \vert s_k)] \gamma^{k-1}Q_\theta(s, a) \right]$$

This is the heart of the policy gradient theorem. If we wanted to use $p_\theta^\gamma(s)$ instead of $\rho_\theta^\gamma(s)$, we can simply multiply the beginning of the right hand side by $\frac{1}{1-\gamma}$. What the policy gradient theorem tells you is that to optimize for the expected return of rewards, which is defined with respect to the parameters of your action policy, you must adjust the weights of your policy network so that actions that are likely to occur with high returns should be more likely to occur, and the actions that result in lower returns should occur with lower probability (which we know because of the Q-function).

REINFORCE

REINFORCE is one of the most foundational policy gradient algorithms that directly leverage the results of the policy gradient theorem to train a policy $\pi$. Given some arbitrary policy $\pi_\theta$, we define the update step $\pi_{\theta+1}$ to be defined as

$$\begin{aligned} \theta_{i+1} &\leftarrow \theta_i + \alpha\nabla_\theta J(\theta) \\ \implies \theta_{i+1} &\leftarrow \theta_i + \alpha\sum_{k=1}^{T} \nabla_\theta \log[\pi_\theta (a_k \vert s_k)] \gamma^{k-1}G_k \end{aligned}$$

Note that we use $G_k$ instead of $Q(s, a)$ as we do not know what the true Q-function is (since it’s unknown). Note that this would mean that we would roll out an entire episode to compute what $G_k$ would be, Monte Carlo style (we’re not learning any sort of value function or Q function here).

This is what the training loop for REINFORCE would look like:

initialize some stochastic policy $\pi_\theta$ with parameters $\theta$. Define $\tau = \emptyset$ to be the set containing all state-action-reward tuples at each timestep. Define the length of some episode $E = T$.
while $\pi_\theta$ not converged:
1. roll out some episode $E$ and store (s_t, a_t, r_t) tuples into $\tau$.
2. for $k = 1, 2, ..., T$:
  1. compute $G_k = \sum_{l=k}^T r_l \gamma^{l-1}$.
  2. update $\theta \leftarrow \theta + \alpha\nabla_\theta \log[\pi_\theta (a_k \vert s_k)] \gamma^{k-1}G_k$.

Advantage Actor Critic (A2C)

Actor-Critic methods are similar to REINFORCE except we estimate $G_k$ using TD-Learning of a value function instead of doing Monte Carlo rollouts. We define the actor as the action policy while the critic refers to the value function. Since we’re taking the one-step TD method to estimate the return (i.e. $G_t = r_t + \gamma V_{\theta_v}(s_{t+1})$), this means that our update equation becomes

$$ \theta_{i+1} \leftarrow \theta_i + \alpha\sum_{k=1}^{T} \nabla_\theta \log[\pi_\theta (a_k \vert s_k)] \gamma^{k-1} \left( r_t +\gamma V_{\theta_v}(s_{t+1}) - V_{\theta_v}(s_t) \right)$$

Recall that $\delta_t = r_t +\gamma V_{\theta_v}(s_{t+1}) - V_{\theta_v}(s_t)$. Because you can think of $r_t +\gamma V_{\theta_v}(s_{t+1})$ as $Q(s_t, a_t)$, where $a_t \sim \pi_{\theta_\pi}$, this means that $\delta_t = Q(s_t, a_t) - V_{\theta_v}(s_t) = \text{Adv}(s_t, a_t)$, which is where the name advantage actor critic comes from. Note that unlike REINFORCE, A2C is an online algorithm as we update $V_{\theta_v}$ at every timestep.

This is what our training loop would look like:

initialize some stochastic actor policy $\pi_{\theta_\pi}$ with parameters $\theta_\pi$, critic policy $V_{\theta_v}$ with parameters $\theta_v$.
while $\pi_{\theta_\pi}$ and $V_{\theta_v}$ not converged:
1. start with some initial state $s_0$.
2. for $t = 0, 1, ...$:
  1. define $a_t \sim \pi_{\theta_\pi}(a_t \mid s_t)$.
  2. step in environment with $(s_t, a_t)$ to get $(s_{t+1}, r_t, \text{done}_t)$.
  3. compute target $y_t = r_t +\gamma (1 - \text{done}_t)V_{\theta_v}(s_{t+1})$
  4. compute $\delta_t = y_t - V_{\theta_v}(s_t)$.
  5. update critic $\theta_v \leftarrow \theta_v + \alpha_v \delta_t \nabla_{\theta_v}V_{\theta_v}(s_t)$.
  6. update actor $\theta_\pi \leftarrow \theta_\pi + \alpha_\pi \delta_t \nabla_{\theta_\pi} \log \pi_{\theta_\pi}(a_t \mid s_t)$.
  7. if $\text{done}_t = 1$ then exit current episode.

Generalize Advantage Estimation (GAE)

Just like how TD-$\lambda$ builds on top of TD-Learning, GAE allows for the idea of $n$-step returns to be introduced to advantage estimation, where we can define the $n$-step estimate as (eq 3.39 of Murphy):

$$G_{t:t+n} = r_t + \gamma r_{t+1} + ... + \gamma^{n} V_{\theta_v}(s_{t+n})$$

This would mean that the $n$-step Advantage estimate would be defined as $A_{\theta_v} = G_{t:t+n} - V_{\theta_v}(s_t)$. If you think about this formula, you’ll realize that as we increase the number of steps of our advantage estimate, you increase variance with lower bias (and vice versa when choosing a lower number of steps). A better approach to minimize both bias and variance would be to take a weighted average of the advantage estimate over some fixed number of steps (eq 3.45 of Murphy):

$$A_t = \frac{\sum_{n=1}^{T} \lambda^{n-1} A_t^n}{\sum_{n=1}^{T} \lambda^{n-1}}$$

This means that given $\delta_t = r_t +\gamma V_{\theta_v}(s_{t+1}) - V_{\theta_v}(s_t)$, we get the following recursive calculation for the advantage estimate (eq 3.47 of Murphy):

$$\begin{aligned} A_t &= \delta_t + \gamma \lambda \delta_{t+1} + ... + (\gamma \lambda)^{T-(t+1)}\delta_{T-1} \\ &= \delta_t + \gamma\lambda A_{t+1} \end{aligned}$$

To summarize, this is what the overall definition of GAE looks like :

Define $\text{A}_\text{curr} = 0$.
for $t = T, T-1, ..., 1$:
1. calculate $\delta_t = r_t + \gamma V_{\theta_v}(s_{t+1}) - V_{\theta_v}(s_t)$.
2. update $\text{A}_\text{curr} = \delta_t + \gamma \lambda \text{A}_\text{curr}$.
3. update $A_t = \text{A}_\text{curr}$.
4. update TD target $y_t = A_t + V_{\theta_v}(s_{t})$.

One thing that wasn’t obvious to me when I first saw this definition of GAE was the fact that instead of computing the advantage from $t =1$ to $t = T$, we compute it the other way around. This is because of the way $\delta$ is recursively defined, where we need future states to calculate current states. Computing backwards (from $t=T$ to $t=1$) allows us to be able to make one backward sweep, where we can use the Advantages from future timestep directly in the earlier time steps.

The only changed needed here is that when we train our policy $\pi$, we substitute $A_t$ into our gradient updates instead of $\delta$. Note that the gradient update for the critic $\theta_v$ can also be defined in terms of a loss objective that we try to minimize, where $\mathcal{L}_v = \frac{1}{2} \left( y_t - V_{\theta_v}(s_t) \right)^2$, and the actor’s loss with respect to the negative log-likelihood loss, $\mathcal{L}_\pi = -\log \pi_{\theta_\pi}(a_t \mid s_t)A_t$.

Deep Deterministic Policy Gradients

Actor Critic methods typically sample their actions from the policy, where $a_t \sim \pi(a_t \mid s_t)$. What about the case where we want to predict deterministic actions at each state, i.e. $a_t = \mu_\theta(s_t)$? We can think of Deterministic Policy Gradient methods as an extension of DQNs in the continuous learning space. Recall the definition of $J(a)$:

$$\begin{aligned} J(a) &= J(\mu_\theta) \\ &= \mathbb{E}_{\rho_{\mu_\theta}(s)} \left[ R(s, \mu_\theta(s)) \right] \end{aligned}$$

which would mean that its gradient would be defined as

$$\begin{aligned} & \nabla_\theta J(\mu_\theta) \\ &= \mathbb{E}_{\rho_{\mu_\theta}(s)} \left[\nabla_\theta Q(s, \mu_\theta(s)) \right] \\ &= \mathbb{E}_{\rho_{\mu_\theta}(s)} \left[\nabla_\theta \mu_\theta(s)\nabla_a Q(s, a)\vert_{a = \mu_\theta(s)} \right] \end{aligned}$$

Note that we simply pull out $\mu_\theta(s)$ by definition of the chain rule. The equation above is what we define as the Deterministic Policy Gradient Theorem. One of the bigger consequences of having deterministic actions is that it isn’t possible for you to perform explanation. The way we overcome this is by doing off-policy training with a completely stochastic policy. Let’s define $\pi_b$ as a completely stochastic policy, with state distribution $\rho_{\pi_b}^\gamma$. Because we now introduce stochasticity, we redefine $J(\mu_\theta)$ to become $J_b(\mu_\theta)$ with the following definition (eq 3.72 of Murphy):

$$\begin{aligned} J(\mu_\theta) &= J_b(\mu_\theta) \\ &= \mathbb{E}_{\rho_{\pi_b}^\gamma(s)} \left[ V_{\mu_\theta(s)}(s) \right] \\ &= \mathbb{E}_{\rho_{\pi_b}^\gamma(s)} \left[ Q_{\mu_\theta}(s, \mu_\theta(s)) \right] \end{aligned}$$

Note that we can directly substitute the Q function for the value function as we know $\pi$ is completely deterministic. Using this information, we can define the gradient as (eq 3.73 of Murphy):

$$\begin{aligned} \nabla_\theta J_b(\mu_\theta) &= \mathbb{E}_{\rho_{\pi_b}^\gamma(s)} \left[ \nabla_\theta Q_{\mu_\theta}(s, \mu_\theta(s)) \right] \\ &= \mathbb{E}_{\rho_{\pi_b}^\gamma(s)} \left[ \nabla_\theta \mu_\theta(s) \nabla_a Q_{\mu_\theta}(s, a) \vert_{a = \mu_\theta(s)} \right] \end{aligned}$$

The reason we substitute $a = \mu_\theta(s)$ into the Q-function is because we pull $\mu$ out due to the chain rule. In theory, we don’t know what $ Q_{\mu_\theta}$ actually is, so instead we approximate it by defining Q to be a neural network with some weights $w$. This gives us the following update rules (eq 3.74 - 3.76 of Murphy):

$$\begin{aligned} \delta &= r_t + \gamma Q_w(s_{t+1}, \mu_{\theta}(s_{t+1})) - Q_w(s_{t}, a_t)\\ w_{t+1} &\leftarrow w_t + \alpha_w \delta \nabla_w Q_w(s_t, a_t)\\ \theta_{t+1} &\leftarrow \theta_t + \alpha_\theta \nabla_\theta \mu_\theta(s_t)\nabla_a Q_{w}(s, a) \vert_{a = \mu_\theta(s)} \\ \end{aligned}$$

The idea of modelling the Q-function as a neural network is the key idea behind Deep Deterministic Policy Gradient (DDPG). In practice, this means that we learn the Q function similar to how we learn it in Deep Q Learning, where our loss is defined as (eq 3.78 of Murphy):

$$\mathcal{L}(s, a, r, s') = \left[Q_w(s, a) - (r + \gamma Q_{\bar{w}}(s', \mu_\theta(s'))) \right]^2$$

that we try to minimize. Our actor loss can be defined as $\mathcal{L}_\theta = Q_w(s, \mu_\theta(s))$, which we essentially try to do gradient ascent on (i.e. maximize $\mathcal{L}_\theta$). Let’s walk through the overall algorithm to understand how DDPG trains:

Initialize Q-Function $Q_w$, deterministic policy $\mu_\theta$. Also define $\bar{w}$ to be the target Q-network weights.
Initialize replay buffer $\mathcal{D} = \emptyset$, which we will fill with state-action information throughout the episode.
while $Q_w, \mu_\theta$ not converged:
1. for state $s$ in episode $E$:
  1. define action $a_t = \mu_\theta(s_t)$ and step with $a_t$ in the environment.
  2. append $(s, a, r, s', \text{done})$ to $\mathcal{D}$.
2. for transition batch $\mathcal{B} \in \mathcal{D}$:
  1. calculate target $y = r + \gamma (1- \text{done}) Q_{\bar{w}}(s', \mu_\theta(s'))$.
  2. update $w \leftarrow w - \alpha \nabla_w \frac{1}{\vert \mathcal{B} \vert} \sum_{(s, a, r, s', \text{done}) \in \mathcal{B}} (Q_w(s, a) - y)^2 $
  3. update $\theta \leftarrow \theta + \alpha \nabla_\theta \frac{1}{\vert \mathcal{B} \vert} \sum_{s \in \mathcal{B}} Q_w(s, \mu_\theta(s))$
3. update target Q-network with EMA smoothing: $\bar{w} \leftarrow \rho (\bar{w}) + (1-\rho) w$

Twin Delayed DDPG (TD3)

Twin Delayed DDPG extends DDPG by adding 3 different methods to improve training: target policy smoothing, clipped Double Q Learning, and delayed policy updates. Let’s go through these one at a time.

Target policy smoothing simply means that we add some noise to the action $a$ to increase policy robustness and generalization:

$$a = \mu_\theta(s) + \epsilon = \pi_\theta(s), \ \epsilon \sim \mathcal{N}(0, \sigma)$$

Just like how we talked about Double Q Learning earlier, we apply the same concept to the Q networks here. What this means is that we take the smaller Q-value target (to deal with overestimates) and use that Q-network as the target network:

$$y = r + \gamma (1- \text{done}) \min_{i=1,2}Q_{\bar{w}_i}(s', \pi_\theta(s))$$

Which would mean that our losses would look like the following:

$$\begin{aligned} \mathcal{L}(w_1, B) &= \mathbb{E}_{(s, a, r, s', \text{done}) \sim B} \left[(Q_{w_1}(s, a) - y) ^2 \right] \\ \mathcal{L}(w_2, B) &= \mathbb{E}_{(s, a, r, s', \text{done}) \sim B} \left[(Q_{w_2}(s, a) - y) ^2 \right] \end{aligned}$$

We then simply define the policy to maximize $Q_{w_1}$ i.e. $\pi_\theta = \max_\theta \mathbb{E} \left[Q_{w_1}(s, \mu_\theta(s)) \right]$.

When we refer to delayed policy updates, we mean that we update the policy after we see a good amount of convergence on the value function. Increasing the learning rate on $V$ is one of the ways we can force convergence to occur faster. The original TD3 paper mentions that doing one policy update for every 2 Q-function updates can allow for more stability.

Let $d$ be defined as the number of steps we wait to update our policy $\pi$. The update logic would then be as follows (in step 3.2 of the DDPG algorithm):

Define running count $i = 0$
for transition batch $\mathcal{B} \in \mathcal{D}$:
1. $i \leftarrow i + 1$.
2. calculate target $y = r + \gamma (1- \text{done}) Q_{\bar{w}}(s', \mu_\theta(s'))$.
3. update $w \leftarrow w - \alpha \nabla_w \frac{1}{\vert \mathcal{B} \vert} \sum_{(s, a, r, s', \text{done}) \in \mathcal{B}} (Q_w(s, a) - y)^2 $
4. if $i \bmod d = 0$:
  1. update $\theta \leftarrow \theta + \alpha \nabla_w \frac{1}{\vert \mathcal{B} \vert} \sum_{s \in \mathcal{B}} Q_w(s, \mu_\theta(s))$

Trust Region Policy Optimization (TRPO)

In the standard policy gradient algorithm, we might do some form of EMA-style smoothing to ensure that our parameters $\theta$ don’t update in large steps. The objective of TRPO is to take the largest gradient steps we can take to update $\theta$, subjected to a KL-divergence constraint. Recall that the KL-divergence is defined to be the “difference” between 2 probability distributions, which we define as the following:

$$\begin{aligned} D_{\mathrm{KL}}(p\Vert q) &= \mathbb{E}_{x \sim p} \left[ \log \frac{p(x)}{q(x)} \right] \\ &= \int p(x) \log\frac{p(x)}{q(x)} \,dx \end{aligned}$$

This means that our loss is dependant on the parameters of the “old” neural network $\pi_{\text{old}}$ and the new policy $\pi$, which we can define as:

$$\mathcal{L}(\pi_{\text{old}}, \pi_\theta)= \mathbb{E}_{s\sim \rho_{\pi_{\text{old}}}^\gamma,\ a\sim \pi_{\text{old}}(a \vert s)} \left[ \frac{\pi_\theta(a \vert s)}{\pi_{\text{old}}(a \vert s)}\, A^{\pi_{\text{old}}}(s, a) \right]$$

So when we perform our $\theta$ updates, we define the $\theta$-update to be:

$$\theta_{i+1} =\argmax_\theta\, \mathcal{L}(\pi_{\text{old}}, \pi_{\theta}),\ \ \text{s.t.}\ \ \mathbb{E}_{s\sim \rho_{\pi_{\text{old}}}}\!\left[ D_{\mathrm{KL}}\big(\pi_{\text{old}}(\cdot\vert s)\,\Vert\,\pi_{\theta}(\cdot\vert s)\big) \right] \leq \delta$$

where $\delta$ is simply defined to be the KL budget (usually a very tiny value like 0.001). Since the TRPO loss function is pretty computationally expensive, we can take the first-order Taylor series expansion of $\mathcal{L}(\pi_{old}, \pi_\theta)$ with respect to $\theta$ to approximate the loss:

$$\begin{aligned} \mathcal{L}(\pi_{\text{old}}, \pi_\theta) &= \mathbb{E}_{\rho_{\pi_{\text{old}}}^\gamma(s),\ a\sim \pi_{\text{old}}(a\vert s)} \left[ \frac{\pi_\theta(a \vert s)}{\pi_{\text{old}}(a \vert s)}\, A^{\pi_{\text{old}}}(s, a) \right] \\ &\approx \mathcal{L}(\pi_{\text{old}}, \pi_{\text{old}}) + \left[\nabla_{\theta_\text{new}} \mathcal{L}(\pi_{\text{old}}, \pi_\theta) \right]^T \vert_{\theta = \theta_\text{old}}(\theta - \theta_{\text{old}}) \\ &\approx g^T (\theta - \theta_{\text{old}}),\ \ g = \nabla_{\theta_\text{new}} \mathcal{L}(\pi_{\text{old}}, \pi_\theta) \vert_{\theta = \theta_\text{old}} \end{aligned}$$

Let’s also do the same to the KL objective to determine what our constraint would be, but instead take a second order expansion:

$$ \begin{aligned} D_{\mathrm{KL}}(\pi_{\text{old}}(\cdot \vert s)\,\Vert\,\pi_{\theta}(\cdot \vert s)) &= \mathbb{E}_{a\sim \pi_{\text{old}}(a\vert s)}\!\left[\log \frac{\pi_{\text{old}}(a\vert s)}{\pi_{\theta}(a\vert s)}\right] \\ &\approx D_{\mathrm{KL}}(\pi_{\text{old}}, \pi_{\text{old}}) + \left[\nabla_\theta D_{\mathrm{KL}}(\pi_{\text{old}}, \pi_\theta)\right]^T\vert_{\theta = \theta_\text{old}} (\theta - \theta_{\text{old}}) + \tfrac{1}{2} (\theta - \theta_{\text{old}})^T H_{\mathrm{KL}}(\pi_{\text{old}}, \pi_\theta)\, (\theta - \theta_{\text{old}}) \\ &\approx \tfrac{1}{2} (\theta - \theta_{\text{old}})^T H_{\mathrm{KL}}(\pi_{\text{old}}, \pi_\theta)\, (\theta - \theta_{\text{old}}) \end{aligned} $$

We can then analytically solve this problem through Lagrangian methods. Note that we introduce $\alpha \in (0, 1)$ to ensure that we meet the KL-budget constraint, so we take the smallest non-negative $j$ that solves for this:

$$ \theta_{k+1} = \theta_k + \alpha^{j}\,\sqrt{\frac{2\delta}{g^\top H^{-1} g}}\; H^{-1} g $$

In practice, computing $H$ and then solving for its inverse is expensive. Since we’re attempting to compute $H^{-1} g$, what we can do instead is use conjugate gradient methods to solve for some $x$ where $Hx = g \implies x = H^{-1}g$ and substitute that $x$ in the equation.

This is what the overall training would look like:

Initialize stochastic policy $\pi_\theta$, value function $V_\phi$. Choose KL budget $\delta$, coefficient $\alpha$, discount factor $\gamma \in (0,1)$, $\lambda \in (0,1)$.
for iterations $i = 0, 1, 2, ...$:
1. roll out $\pi(\theta_i)$ and append trajectory $\{\tau_i\}$ to $\mathcal{D}$.
2. compute $(A_{1:T}, y_{1:T}) = \text{GAE}(r_{1:T}, V_{\phi_i}(s_{1:T}), \gamma, \lambda)$.
3. estimate the policy gradient using samples from $\mathcal{D}_i$: $$g_i = \frac{1}{|\mathcal{D}_i|} \sum_{\tau \in \mathcal{D}_i} \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t\vert s_t)\big|_{\theta=\theta_i}\, A_t.$$
4. use conjugate gradient to compute $x_i \approx H_i^{-1} g_i$.
5. update policy $\theta_{i+1} = \theta_i + \alpha^{j}\, \sqrt{\frac{2\delta}{x_i^{\top} H_i x_i}}\; x_i$.
6. update value function $\phi_{i+1} = \arg\min_{\phi} \, \frac{1}{|\mathcal{D}_i|T} \sum_{\tau \in \mathcal{D}_i} \sum_{t=0}^{T} \big( V_\phi(s_t) - y_t \big)^2.$

Proximal Policy Optimization (PPO)

PPO simplifies the findings of TRPO with first-order methods (instead of second-order). Although PPO can use a KL-constrained update (like TRPO), in practice we use a clipping objective to ensure that there aren’t large deviations in parameter updates, giving us the following loss definition:

$$\mathcal{L}(\pi_{\text{old}}, \pi_\theta) = \min \left[\frac{\pi_\theta(s, a)}{\pi_{\text{old}}(s, a)} A^{\pi_{\text{old}}}(s, a), \text{clip} \left(1 - \epsilon, \frac{\pi_\theta(s, a)}{\pi_{\text{old}}(s, a)}, 1 + \epsilon \right) A^{\pi_{\text{old}}}(s, a) \right]$$

I’d recommend you tracing out the 6 possible edge cases that the objective can be reduced to for better intuition behind the loss function ($A > 0, A < 0, \frac{\pi_\theta(s, a)}{\pi_{\text{old}}(s, a)} > 1 + \epsilon, \frac{\pi_\theta(s, a)}{\pi_{\text{old}}(s, a)} < 1 - \epsilon$).

We train this similar to TRPO with our modified loss function:

Initialize stochastic policy $\pi_\theta$, value function $V_\phi$. Define discount factor $\gamma \in (0,1)$, $\lambda \in (0,1)$.
for iterations $i = 0, 1, 2, ...$:
1. roll out $\pi(\theta_i)$ and append trajectory $\{\tau_i\}$ to $\mathcal{D}$.
2. compute $(A_{1:T}, y_{1:T}) = \text{GAE}(r_{1:T}, V_{\phi_i}(s_{1:T}), \gamma, \lambda)$.
3. compute loss $$\mathcal{L}(\pi_{\text{old}}, \pi_\theta) = \min \left[\frac{\pi_\theta(s, a)}{\pi_{\text{old}}(s, a)} A^{\pi_{\text{old}}}(s, a), \text{clip} \left(1 - \epsilon, \frac{\pi_\theta(s, a)}{\pi_{\text{old}}(s, a)}, 1 + \epsilon \right) A^{\pi_{\text{old}}}(s, a) \right]$$
4. update policy weights $\theta \leftarrow \theta + \alpha \nabla_\theta \mathcal{L}(\pi_{\text{old}}, \pi_\theta)$.
5. update value function $\phi_{i+1} = \arg\min_{\phi} \, \frac{1}{|\mathcal{D}_i|T} \sum_{\tau \in \mathcal{D}_i} \sum_{t=0}^{T} \big( V_\phi(s_t) - y_t \big)^2.$

where $\alpha$ denotes our learning rate.

Soft Actor Critic (SAC)

SAC is an off-policy actor-critic algorithm, similar to DDPG but with the idea of entropy-regularized learning. Recall the definition of entropy:

$$\mathbb{H}(p) = \mathbb{E}_{x \sim p} [- \log p(x)]$$

In the case where we observe events with low probability, we can expect to have high entropy as we observe scenarios that are unlikely. In entropy-based RL, we can modify the definition of the value of the policy to account for an entropy bonus (eq 3.148 of Murphy):

$$ J^{\text{SAC}}(\theta) = \mathbb{E}_{s\sim p^{\gamma}_{\pi_\theta},\, a\sim \pi_\theta(\cdot\,|\,s)} \left[ R(s,a) + \alpha\ \mathbb{H}\!\big(\pi_\theta(\cdot\,|\,s)\big) \right]$$

where $\mathbb{H}\!\big(\pi_\theta(\cdot\,|\,s)\big)$ is defined as our entropy bonus multiplied by a scaling factor (known as the temperature) $\alpha$. Using this definition, we can then define both $V$ and $Q$ as the following:

$$\begin{aligned} V^{\pi}(s) &= \mathbb{E}_{p_\theta(\tau)} \left[\sum_{t=0}^{\infty} (\gamma^{t}(R(s_t,a_t,s_{t+1}) + \alpha\, \mathbb{H}(\pi(\cdot \vert s_t)))) \right] \\ Q^{\pi}(s,a) &= \mathbb{E}_{p_\theta(\tau)} \left[ \sum_{t=0}^{\infty} \gamma^{t}R(s_t,a_t,s_{t+1}) + \alpha \sum_{t=1}^{\infty} \gamma^{t}\, \mathbb{H}(\pi(\cdot \vert s_t)) \right] \end{aligned}$$

The reason why we add the entropy bonus after the first timestep for $Q^{\pi}(s,a)$ is because we’re already given an action $a$ when $t = 0$ (by definition of $Q$), so there cannot be any entropy there.

Just like in TD3, we have 2 Q-networks that we’re trying to learn to allow for stable convergence. The key implementation difference compared to TD3 lies in the fact that we incorporate the use of entropy regularization along with no target policy smoothing (adding noise), because SAC itself is a stochastic policy algorithm (so more noise would be redundant).

Similar to TD3, our Q-function loss would look like the following

$$\begin{aligned} \mathcal{L}(w_1, \mathcal{D}) &= \mathbb{E}_{(s, a, r, s', \text{done}) \sim \mathcal{D}} \left[(Q_{w_1}(s, a) - y) ^2 \right] \\ \mathcal{L}(w_2, \mathcal{D}) &= \mathbb{E}_{(s, a, r, s', \text{done}) \sim \mathcal{D}} \left[(Q_{w_2}(s, a) - y) ^2 \right] \end{aligned}$$

And in our target calculation, we account for entropy regularization:

$$y = r + \gamma (1- \text{done})( \min_{i=1,2}Q_{\bar{w}_i}(s', \pi_\theta(s)) - \alpha \log \pi_\theta(a\vert s))$$

In SAC we sample states from the replay buffer and reparameterize the policy so that $a(s, \epsilon_\text{noise}) = \text{tanh}(\mu_\theta(s) + \sigma_\theta(s) \odot \epsilon_\text{noise}), \ \epsilon_\text{noise} \sim \mathcal{N}(0, I)$. Since we fix $s$, this allows for a much more stable backpropagation chain for lower variance. In the original SAC paper, the authors squish the outputs of the policy to be between -1 and 1 (following the definition of $\text{tanh}$).

We define the SAC training loop as follows:

Initialize critics $Q_{w_1}, Q_{w_2}$, stochastic policy $\pi_\theta$. Also define $\bar{w}$ to be the target Q-network weights. Define replay buffer $\mathcal{D} = \emptyset$, discount factor $\gamma$, entropy temperature $\alpha$, and target-EMA $\rho$.
while $Q_w, \pi_\theta$ not converged:
1. for state $s$ in episode $E$:
  1. sample action $a \sim \pi_\theta(s)$ and step with $a$ (reparametrized) in the environment.
  2. append $(s, a, r, s', \text{done})$ to $\mathcal{D}$.
2. for transition batch $\mathcal{B} \in \mathcal{D}$:
  1. calculate target $y = r + \gamma(1-\text{done})(\min_{i=1,2} Q_{\bar w_i}(s', a') - \alpha \log \pi_\theta(a' \vert s'))$
  2. compute target loss for $i \in \{1,2 \}$: $\mathcal{L}(w, B) = \frac{1}{|B|}\sum_{(s,a,r,s',d)\in B}(Q_{w_i}(s,a) - y)^2$
  3. update parameters for $i \in \{1,2 \}$: $w_i \leftarrow w_i - \eta \nabla_{w_i}\mathcal{L}(w, B) $
  4. compute policy loss $$\mathcal{L}(w, B) = \frac{1}{|B|}\sum_{s\in B}(\min_{i=1,2} Q_{w_i}(s, a_\theta(s)) - \alpha \log \pi_\theta(a_\theta(s)\mid s))$$
  5. update policy parameters $\theta \leftarrow \theta + \eta\, \nabla_\theta \mathcal{L}(w, B)$.
  6. update Q targets $i \in \{1,2 \}$: $\bar{w}_i \leftarrow \rho\,\bar{w}_i + (1-\rho)\,w_i$.

The chance that there is a mistake that I haven’t caught yet is pretty high 😅. If you do find one, whether it’s a typo or a mistake in an explanation, contact me at srianumakonda@cmu.edu!

Citation

If you found this blog post helpful, you can cite it as:

@article{anumakonda2025rlnotes,
  title   = {An Updated Introduction to Reinforcement Learning},
  author  = {Anumakonda, Sri},
  journal = {srianumakonda.github.io},
  year    = {2025},
  month   = {October},
  url     = {https://srianumakonda.github.io/posts/rl_notes/}
}

References

Murphy, K. (2024). Probabilistic Machine Learning: Reinforcement Learning. MIT Press. https://arxiv.org/abs/2412.05265
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
Achiam, J. (2018). Spinning Up in Deep Reinforcement Learning. OpenAI. https://spinningup.openai.com

What is Reinforcement Learning?#

Markov Decision Processes#

Bellman Equations#

Value-based RL#

Value Iteration#

Policy Iteration#

Policy evaluation#

Policy improvement#

TD Learning#

Monte-Carlo Learning#

TD-$\lambda$#

SARSA#

Q-Learning#

Double Q-Learning#

Policy-Based Reinforcement Learning#

Policy Gradient Theorem#

REINFORCE#

Advantage Actor Critic (A2C)#

Generalize Advantage Estimation (GAE)#

Deep Deterministic Policy Gradients#

Twin Delayed DDPG (TD3)#

Trust Region Policy Optimization (TRPO)#

Proximal Policy Optimization (PPO)#

Soft Actor Critic (SAC)#

Citation#

References#