Assignment 2: DQN, Double DQN, DDPG, and SAC
The assignment starts from value-based RL with DQN, move to continuous control with DDPG, and then extend that to SAC with stochastic policies, twin critics, and entropy regularization.
DQN
DQN is an off-policy RL algorithm that extends Q-learning using deep neural networks. It is designed for environments with discrete action spaces and was used to achieve human-level performance in Atari games in a seminal 2013 paper. Key innovations relative to naive neural fitted Q iteration include replay buffers (which de-correlate samples) and target networks (which give Q learning a stationary target to converge to).
DQN learns a value function: . For each state, it estimates how good each discrete action is, and then chooses the best one:
Bellman target
The DQN update is based on the TD target:
Then the Q-network regresses toward that target:
Conceptually, this means:
- reward now
- plus discounted estimate of best future value
This expectation is taken over random mini-batches of transitions drawn from the replay buffer. The buffer continuously stores experiences generated by the behavior policy. To ensure adequate exploration, this behavior policy typically uses an -greedy approach, where the probability of taking a random action () gradually decays as training progresses.
Implementation
In standard DQN implementation:
- input: state
- output: one Q-value per discrete action
That means the network does not take the action as input. The action space in the assignment environment is continuous ([-3, 3]), but for DQN, it is discretized to [0, 1]:
- action index
0→ environment action-3 - action index
1→ environment action3
So the Q-network outputs two values per state, which is why the network shape is:
- input:
(batch_size, n_obs) - output:
(batch_size, n_actions)
Even though the real environment action is continuous, DQN is only solving a 2-action discrete approximation.
Example Q-network in DQN:
self.q_net = nn.Sequential(
nn.Linear(n_obs, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, n_actions)
).to(device)
To extract the Q-value for the action that was actually taken:
all_Q = self.q_net(states) # shape: (bsz, n_actions)
Q = all_Q.gather(1, actions).squeeze(1) # shape: (bsz,)
That gather step is the bridge between “Q-values for all actions” and “the specific action taken in this transition.”
Why target networks help?
Without a target network, the model is trained against a target that depends on its own current predictions:
The Bellman target changes at the same time as the network trying to fit it, which can cause oscillation or divergence. A target network stabilizes learning by making the TD target move more slowly.
In implementation, target network is the deep copy of the Q-network at the very beginning, and the parameters got periodically updated to Q-network.
def update(self, replay_buffer, i):
for _ in range(64):
loss = self.get_q_loss(*replay_buffer.sample())
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
# Periodic hard update Q-target network to Q-network
if i % 16 == 0:
self.q_target_net.load_state_dict(self.q_net.state_dict())
# decay and log exploration rate
self.exploration_rate = max(self.exploration_rate * 0.985, 0.05)
Extension: Double DQN
Standard DQN uses:
The problem is that the max over noisy estimates tends to be too optimistic.
Double DQN fixes this by separating **action selection (**online network) and **action evaluation (**target network)
So the target becomes: which reduces overestimation bias.
Double DQN improves performance because it reduces overestimation bias from the max operator. The online network chooses the best next action, while the target network evaluates that action.
with torch.no_grad():
next_actions = self.q_net(next_states).argmax(dim=-1, keepdim=True)
next_Q = self.q_target_net(next_states).gather(1, next_actions).squeeze(1)
q_target = rewards + gamma * next_Q * not_dones
DDPG
DDPG is an off-policy RL algorithm that extends DQN to continuous action spaces. It is based off a theoretical publication called Deterministic Policy Gradients. It solved many robotics tasks in a seminal 2015 publication. Key innovations relative to DQN are (1) a policy network which is trained to produce deterministic, continous actions that maximize the Q function, and (2) soft target updates.
Why DDPG over DQN?
DQN works when the action set is small and discrete, because we can evaluate every allowed action and compute . That breaks in continuous action spaces.
If the action can be any real number in ([-3, 3]), then there are infinitely many possible actions. DDPG solves this by introducing an actor network that directly outputs an action:
So instead of explicitly maximizing over actions, DDPG learns a policy that tries to approximate the maximizing action.
DQN explicitly maximizes over a small discrete action set, while DDPG learns an actor that approximates the maximizing action in a continuous action space.
Theory
DDPG has two networks.
Critic: Estimates , Actor: , outputs continuous action
DDPG Bellman target
In DQN, we have
In DDPG, we have
Temporal Difference Q-loss: Because the action space is continuous, DDPG cannot cheaply compute by enumeration. It keeps the TD target structure, but replaces the max with the action proposed by the target policy.
Policy Loss: the actor is trained to choose actions that the critic scores highly.
Why DDPG “drops the max”
DDPG does not actually abandon the goal of maximizing Q. It replaces explicit maximization with a learned actor.
- DQN: “evaluate all actions, then pick the best”
- DDPG: “learn a network that directly outputs a high-value action”
So the max is not gone conceptually. It is handled by the actor. This is the main benefit of separating actor and critic:
- the critic tells us how good a state-action pair is
- the actor learns to output actions that make the critic happy
Soft Update
Instead of hard-update the Q-target network in DQN, DDPG soft-updates target network
def update(self, replay_buffer, i):
for _ in range(64):
loss = self.get_q_loss(*replay_buffer.sample())
self.q_optimizer.zero_grad()
loss.backward()
self.q_optimizer.step()
for _ in range(4):
states, _, _, _, _ = replay_buffer.sample()
loss = self.get_policy_loss(states)
self.policy_optimizer.zero_grad()
loss.backward()
self.policy_optimizer.step()
tau = 0.1 # Continual soft target update
for target_param, param in zip(self.q_target_net.parameters(), self.q_net.parameters()):
target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)
for target_param, param in zip(self.policy_target_net.parameters(), self.policy.parameters()):
target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)
self.exploration_rate = max(self.exploration_rate * 0.985, 0.05)
Implementation
Critic loss
with torch.no_grad():
next_actions = self.get_target_action(next_states)
next_sa = torch.cat([next_states, next_actions], dim=-1)
next_Q = self.q_target_net(next_sa).squeeze(-1)
q_target = rewards + gamma * next_Q * not_dones
sa = torch.cat([states, actions], dim=-1)
Q = self.q_net(sa).squeeze(-1)
q_loss = ((Q - q_target) ** 2).mean()
Policy loss
actions = self.get_action(states)
sa = torch.cat([states, actions], dim=-1)
Q = self.q_net(sa).squeeze(-1)
policy_loss = -Q.mean()
Replay buffer and data diversity
Using much less rollout data and a tiny replay buffer would usually make training less stable, because updates would rely on fewer and more correlated transitions. A larger replay buffer improves sample diversity and helps off-policy learning be more reliable.
SAC (soft actor-critic)
SAC is a model-free off-policy RL algorithm that improves DDPG with better stability and exploration. It was introduced in a seminal publication in 2017, and is often considered the go-to method. Its key innovations relative to DDPG are a stochastic policy, double Q learning, and entropy regularization.
Theory
Stochastic policy
DDPG uses a deterministic actor and adds exploration noise externally.
def get_action(self, states, noisy=False):
actions = self.policy(states)
if noisy:
actions += torch.normal(0, self.exploration_rate, size=actions.shape).to(device)
return actions.clamp(-3, 3)
SAC instead learns a distribution over actions: Instead of outputting just one action, the policy outputs Gaussian parameters: (mean, log standard deviation)
So the policy represents:
Twin critics
aims to overestimation bias. Key inspiration:
if and are noisy estimates, taking the maximum tends to produce a value that is too large on average. This is why max-based methods can become overoptimistic.
SAC mitigates this by learning two critics: and
and using the smaller one when building the target. This makes the target more conservative and reduces overestimation.
💡 For the actor update, one critic is enough to provide a policy gradient signal, so the assignment uses only in the policy loss.
Entropy regularization
SAC does not only want the policy to choose high-value actions. It also wants the policy to remain sufficiently stochastic.
This is encoded using entropy:
High entropy means the action distribution is broader. Low entropy means it is narrow and nearly deterministic. Entropy helps because it encourages exploration and avoids collapsing too early to a narrow policy.
Mathematical form
Critic target:
Q-loss:
Policy loss:
So the policy is rewarded for:
- choosing actions that look valuable
- keeping enough entropy to continue exploring
then controls the tradeoff between reward maximization and entropy maximization
Implementation
the policy network’s output ends up with
nn.Linear(128, 2 * n_actions)
and we split using:
mean, log_std_dev = self.policy(states).chunk(2, dim=-1)
rsample() and the reparameterization trick:
If we simply sample from a Gaussian, the action looks like an opaque random draw. That makes it hard for gradients to flow back into the policy parameters.
With reparameterization:
the randomness is isolated in , while and stay inside a differentiable expression. This matters because the policy loss depends on the sampled action through the critic.
std_dev = log_std_dev.exp().clamp(.2, 2)
action = Normal(mean, std_dev).rsample()
action = action.clamp(-3, 3) # clamp to allowed continuous range
putting all together
Q-loss:
alpha = 0.002
with torch.no_grad():
next_actions = self.get_target_action(next_states, noisy=True)
next_sa = torch.cat([next_states, next_actions], dim=-1)
next_Q1 = self.q1_target_net(next_sa).squeeze(-1)
next_Q2 = self.q2_target_net(next_sa).squeeze(-1)
# take the minimum of Q_1 and Q_2 to avoid overestimation
next_Q = torch.min(next_Q1, next_Q2)
# entropy term to encourage exploration
H = self.get_entropy(next_states).squeeze(-1)
q_target = rewards + gamma * (next_Q + alpha * H) * not_dones
sa = torch.cat([states, actions], dim=-1)
Q1 = self.q1_net(sa).squeeze(-1)
Q2 = self.q2_net(sa).squeeze(-1)
# regress both Q1 and Q2 on the target Q
loss = ((Q1 - q_target) ** 2).mean() + ((Q2 - q_target) ** 2).mean()
return loss
policy loss:
alpha = 0.002
actions = self.get_action(states, noisy=True)
sa = torch.cat([states, actions], dim=-1)
Q1 = self.q1_net(sa).squeeze(-1)
H = self.get_entropy(states).squeeze(-1)
policy_loss = -(Q1 + alpha * H).mean()
Final comparison
DQN
- discrete actions
- one Q-network outputs all action values
- action chosen with
argmax - uses target network
Double DQN
DDPG
- continuous actions
- critic learns
- actor outputs continuous action
- replaces explicit max with actor
- uses replay buffer and target networks
SAC
- continuous actions
- stochastic actor
- twin critics
- entropy regularization
- more robust exploration and less overestimation