← Notes
note · February 25, 2026

Assignment 2: DQN, Double DQN, DDPG, and SAC

RL

The assignment starts from value-based RL with DQN, move to continuous control with DDPG, and then extend that to SAC with stochastic policies, twin critics, and entropy regularization.

DQN

DQN is an off-policy RL algorithm that extends Q-learning using deep neural networks. It is designed for environments with discrete action spaces and was used to achieve human-level performance in Atari games in a seminal 2013 paper. Key innovations relative to naive neural fitted Q iteration include replay buffers (which de-correlate samples) and target networks (which give Q learning a stationary target to converge to).

DQN learns a value function: Q(s,a)Q(s,a). For each state, it estimates how good each discrete action is, and then chooses the best one: a=arg maxaQ(s,a)a^* = \argmax_a Q(s,a)


Bellman target

The DQN update is based on the TD target: qtarget=r+γmaxa’Q(s,a)q_\text{target} = r + \gamma \max_\text{a’} Q{}(s’, a’)

Then the Q-network regresses toward that target: LQ=E[(Q(s,a)qtarget)2]\mathcal{L}_Q = \mathbb{E}\left[(Q(s,a) - q_{\text{target}})^2\right]

Conceptually, this means:

  • reward now
  • plus discounted estimate of best future value

This expectation is taken over random mini-batches of transitions (s,a,r,s)(s, a, r, s') drawn from the replay buffer. The buffer continuously stores experiences generated by the behavior policy. To ensure adequate exploration, this behavior policy typically uses an ϵ\epsilon-greedy approach, where the probability of taking a random action (ϵ\epsilon) gradually decays as training progresses.


Implementation

In standard DQN implementation:

  • input: state
  • output: one Q-value per discrete action

That means the network does not take the action as input. The action space in the assignment environment is continuous ([-3, 3]), but for DQN, it is discretized to [0, 1]:

  • action index 0 → environment action -3
  • action index 1 → environment action 3

So the Q-network outputs two values per state, which is why the network shape is:

  • input: (batch_size, n_obs)
  • output: (batch_size, n_actions)

Even though the real environment action is continuous, DQN is only solving a 2-action discrete approximation.

Example Q-network in DQN:

self.q_net = nn.Sequential(
    nn.Linear(n_obs, 128),
    nn.ReLU(),
    nn.Linear(128, 128),
    nn.ReLU(),
    nn.Linear(128, n_actions)
).to(device)

To extract the Q-value for the action that was actually taken:

all_Q = self.q_net(states)                 # shape: (bsz, n_actions)
Q = all_Q.gather(1, actions).squeeze(1)   # shape: (bsz,)

That gather step is the bridge between “Q-values for all actions” and “the specific action taken in this transition.”


Why target networks help?

Without a target network, the model is trained against a target that depends on its own current predictions:

Qθ(s,a)vsr+γmaxaQθ(s,a)Q_\theta(s,a) \quad \text{vs} \quad r + \gamma \max_{a'} Q_\theta(s', a')

The Bellman target changes at the same time as the network trying to fit it, which can cause oscillation or divergence. A target network stabilizes learning by making the TD target move more slowly.

In implementation, target network is the deep copy of the Q-network at the very beginning, and the parameters got periodically updated to Q-network.

def update(self, replay_buffer, i):
    for _ in range(64):
        loss = self.get_q_loss(*replay_buffer.sample())
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
    # Periodic hard update Q-target network to Q-network
    if i % 16 == 0:
        self.q_target_net.load_state_dict(self.q_net.state_dict())
    # decay and log exploration rate
        self.exploration_rate = max(self.exploration_rate * 0.985, 0.05)

Extension: Double DQN

Standard DQN uses: maxaQtarget(s,a)\max_{a'} Q_{\text{target}}(s', a')

The problem is that the max over noisy estimates tends to be too optimistic.

Double DQN fixes this by separating **action selection (**online network) and **action evaluation (**target network)

So the target becomes: qtarget=r+γQtarget(s,arg maxaQonline(s,a))q_\text{target} = r + \gamma Q_\text{target}(s’, \argmax_{a’} Q_\text{online}(s’, a’)) which reduces overestimation bias.

Double DQN improves performance because it reduces overestimation bias from the max operator. The online network chooses the best next action, while the target network evaluates that action.

with torch.no_grad():
    next_actions = self.q_net(next_states).argmax(dim=-1, keepdim=True)
    next_Q = self.q_target_net(next_states).gather(1, next_actions).squeeze(1)
    q_target = rewards + gamma * next_Q * not_dones

DDPG

DDPG is an off-policy RL algorithm that extends DQN to continuous action spaces. It is based off a theoretical publication called Deterministic Policy Gradients. It solved many robotics tasks in a seminal 2015 publication. Key innovations relative to DQN are (1) a policy network which is trained to produce deterministic, continous actions that maximize the Q function, and (2) soft target updates.

Why DDPG over DQN?

DQN works when the action set is small and discrete, because we can evaluate every allowed action and compute argmaxaQ(s,a)\arg\max_a Q(s,a). That breaks in continuous action spaces.

If the action can be any real number in ([-3, 3]), then there are infinitely many possible actions. DDPG solves this by introducing an actor network that directly outputs an action: a=π(s)a = \pi(s)

So instead of explicitly maximizing over actions, DDPG learns a policy that tries to approximate the maximizing action.

DQN explicitly maximizes over a small discrete action set, while DDPG learns an actor that approximates the maximizing action in a continuous action space.


Theory

DDPG has two networks.

Critic: Estimates Q(s,a)Q(s,a), Actor: π(s)\pi(s), outputs continuous action

DDPG Bellman target

In DQN, we have r+γmaxaQ(s,a)r + \gamma \max_{a’} Q(s’, a’)

In DDPG, we have r+γQtarget(s,πtarget(s))r + \gamma Q_{\text{target}}(s', \pi_{\text{target}}(s'))

Temporal Difference Q-loss: Because the action space is continuous, DDPG cannot cheaply compute maxaQ(s,a)\max_a Q(s,a) by enumeration. It keeps the TD target structure, but replaces the max with the action proposed by the target policy.

L(θ)=E[{Qθ(s,a)(rt+γQθtarget(s,a))}2] \mathcal{L}(\theta) = \mathbb{E}[\{Q_\theta(s, a) - (r_t + \gamma Q_{\theta_\text{target}}(s', a') )\}^2]

Policy Loss: the actor is trained to choose actions that the critic scores highly.

L(θp)=E[Qθ(s,a)]\mathcal{L}(\theta_p) = -\mathbb{E}[Q_\theta(s, a)]

Why DDPG “drops the max”

DDPG does not actually abandon the goal of maximizing Q. It replaces explicit maximization with a learned actor.

  • DQN: “evaluate all actions, then pick the best”
  • DDPG: “learn a network that directly outputs a high-value action”

So the max is not gone conceptually. It is handled by the actor. This is the main benefit of separating actor and critic:

  • the critic tells us how good a state-action pair is
  • the actor learns to output actions that make the critic happy

Soft Update

Instead of hard-update the Q-target network in DQN, DDPG soft-updates target network

def update(self, replay_buffer, i):

      for _ in range(64):
          loss = self.get_q_loss(*replay_buffer.sample())
          self.q_optimizer.zero_grad()
          loss.backward()
          self.q_optimizer.step()

      for _ in range(4):
          states, _, _, _, _ = replay_buffer.sample()
          loss = self.get_policy_loss(states)
          self.policy_optimizer.zero_grad()
          loss.backward()
          self.policy_optimizer.step()

      tau = 0.1  # Continual soft target update
      for target_param, param in zip(self.q_target_net.parameters(), self.q_net.parameters()):
          target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)

      for target_param, param in zip(self.policy_target_net.parameters(), self.policy.parameters()):
          target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)

      self.exploration_rate  = max(self.exploration_rate  * 0.985, 0.05)

Implementation

Critic loss

with torch.no_grad():
    next_actions = self.get_target_action(next_states)
    next_sa = torch.cat([next_states, next_actions], dim=-1)
    next_Q = self.q_target_net(next_sa).squeeze(-1)
    q_target = rewards + gamma * next_Q * not_dones

sa = torch.cat([states, actions], dim=-1)
Q = self.q_net(sa).squeeze(-1)

q_loss = ((Q - q_target) ** 2).mean()

Policy loss

actions = self.get_action(states)
sa = torch.cat([states, actions], dim=-1)
Q = self.q_net(sa).squeeze(-1)

policy_loss = -Q.mean()

Replay buffer and data diversity

Using much less rollout data and a tiny replay buffer would usually make training less stable, because updates would rely on fewer and more correlated transitions. A larger replay buffer improves sample diversity and helps off-policy learning be more reliable.


SAC (soft actor-critic)

SAC is a model-free off-policy RL algorithm that improves DDPG with better stability and exploration. It was introduced in a seminal publication in 2017, and is often considered the go-to method. Its key innovations relative to DDPG are a stochastic policy, double Q learning, and entropy regularization.


Theory

Stochastic policy

DDPG uses a deterministic actor and adds exploration noise externally.

def get_action(self, states, noisy=False):
    actions = self.policy(states)
    if noisy:
      actions += torch.normal(0, self.exploration_rate, size=actions.shape).to(device)
    return actions.clamp(-3, 3)

SAC instead learns a distribution over actions: Instead of outputting just one action, the policy outputs Gaussian parameters: (mean, log standard deviation)

So the policy represents: aN(μ(s),σ(s))a \sim \mathcal{N}(\mu(s), \sigma(s))


Twin critics

aims to overestimation bias. Key inspiration: E[max(C1,C2)]max(E[C1],E[C2])\mathbb{E}[\max(C_1, C_2)] \ge \max(\mathbb{E}[C_1], \mathbb{E}[C_2])

if C1C_1 and C2C_2 are noisy estimates, taking the maximum tends to produce a value that is too large on average. This is why max-based methods can become overoptimistic.

SAC mitigates this by learning two critics: Q1Q_1 and Q2Q_2

and using the smaller one when building the target. This makes the target more conservative and reduces overestimation.

💡 For the actor update, one critic is enough to provide a policy gradient signal, so the assignment uses only Q1Q_1 in the policy loss.


Entropy regularization

SAC does not only want the policy to choose high-value actions. It also wants the policy to remain sufficiently stochastic.

This is encoded using entropy: H(π(s))H(\pi(s))

High entropy means the action distribution is broader. Low entropy means it is narrow and nearly deterministic. Entropy helps because it encourages exploration and avoids collapsing too early to a narrow policy.


Mathematical form

Critic target:

qtarget=r+γ(min(Q1,Q2)+αH(π(s)))q_{\text{target}} = r + \gamma \left(\min(Q_1,Q_2) + \alpha H(\pi(s'))\right)

Q-loss:

L(θ)=E[{Qθ1(s,a)qtarget}2]+E[{Qθ2(s,a)qtarget}2]\mathcal{L}(\theta) = \mathbb{E}[\{Q_{\theta_1}(s, a) - q_{\text{target}}\}^2] + \mathbb{E}[\{Q_{\theta_2}(s, a) - q_{\text{target}}\}^2]

Policy loss:

Lpolicy=E[Q1(s,a)+αH(π(s))]\mathcal{L}_{policy} = -\mathbb{E}[Q_1(s,a) + \alpha H(\pi(s))]

So the policy is rewarded for:

  • choosing actions that look valuable
  • keeping enough entropy to continue exploring

α\alpha then controls the tradeoff between reward maximization and entropy maximization


Implementation

the policy network’s output ends up with

nn.Linear(128, 2 * n_actions)

and we split using:

mean, log_std_dev = self.policy(states).chunk(2, dim=-1)

rsample() and the reparameterization trick:

If we simply sample from a Gaussian, the action looks like an opaque random draw. That makes it hard for gradients to flow back into the policy parameters.

With reparameterization: a=μ+σϵ,  ϵN(0,1)a = \mu + \sigma \epsilon,\; \epsilon \sim \mathcal{N}(0,1)

the randomness is isolated in ϵ\epsilon, while μ\mu and σ\sigma stay inside a differentiable expression. This matters because the policy loss depends on the sampled action through the critic.

std_dev = log_std_dev.exp().clamp(.2, 2)
action = Normal(mean, std_dev).rsample()
action = action.clamp(-3, 3) # clamp to allowed continuous range

putting all together

Q-loss:

alpha = 0.002

with torch.no_grad():
    next_actions = self.get_target_action(next_states, noisy=True)
    next_sa = torch.cat([next_states, next_actions], dim=-1)

    next_Q1 = self.q1_target_net(next_sa).squeeze(-1)
    next_Q2 = self.q2_target_net(next_sa).squeeze(-1)
    # take the minimum of Q_1 and Q_2 to avoid overestimation
    next_Q = torch.min(next_Q1, next_Q2)
		
		# entropy term to encourage exploration
    H = self.get_entropy(next_states).squeeze(-1)
    
    q_target = rewards + gamma * (next_Q + alpha * H) * not_dones
    
	  sa = torch.cat([states, actions], dim=-1)

    Q1 = self.q1_net(sa).squeeze(-1)
    Q2 = self.q2_net(sa).squeeze(-1)
		
		# regress both Q1 and Q2 on the target Q
    loss = ((Q1 - q_target) ** 2).mean() + ((Q2 - q_target) ** 2).mean()
    return loss

policy loss:

alpha = 0.002

actions = self.get_action(states, noisy=True)
sa = torch.cat([states, actions], dim=-1)
Q1 = self.q1_net(sa).squeeze(-1)
H = self.get_entropy(states).squeeze(-1)

policy_loss = -(Q1 + alpha * H).mean()

Final comparison

DQN

Q(s)[Q(s,a1),,Q(s,an)]Q(s) \rightarrow [Q(s,a_1), \dots, Q(s,a_n)] qtarget=r+γmaxaQtarget(s,a)q_{\text{target}} = r + \gamma \max_{a'} Q_{\text{target}}(s', a')
  • discrete actions
  • one Q-network outputs all action values
  • action chosen with argmax
  • uses target network

Double DQN

qtarget=r+γQtarget(s,argmaxaQonline(s,a))q_{\text{target}} = r + \gamma Q_{\text{target}}\Big(s', \arg\max_{a'} Q_{\text{online}}(s', a')\Big)

DDPG

qtarget=r+γQtarget(s,πtarget(s))q_{\text{target}} = r + \gamma Q_{\text{target}}(s', \pi_{\text{target}}(s')) Lpolicy=E[Q(s,π(s))]\mathcal{L}_{policy} = -\mathbb{E}[Q(s,\pi(s))]
  • continuous actions
  • critic learns Q(s,a)Q(s,a)
  • actor outputs continuous action π(s)\pi(s)
  • replaces explicit max with actor
  • uses replay buffer and target networks

SAC

qtarget=r+γ(min(Q1,Q2)+αH(π(s)))q_{\text{target}} = r + \gamma \left(\min(Q_1,Q_2) + \alpha H(\pi(s'))\right) Lpolicy=E[Q1(s,a)+αH(π(s))]\mathcal{L}_{policy} = -\mathbb{E}[Q_1(s,a) + \alpha H(\pi(s))]
  • continuous actions
  • stochastic actor
  • twin critics
  • entropy regularization
  • more robust exploration and less overestimation