DQN 學習筆記

這是論文要研究ㄉ主題，想說都寫好了那就先放上來好了

Basic Components

Env, Actor, Reward Funciton, Episode & Trajectory
Trajectory: \(\tau = \{s_i, a_i,\dots\}\) in 1 episode
可以計算特定 (action, state) pair 出現的機率\(P_\theta(\tau) = p(s_1)p_\theta(a_1\|s_1)p(s_2\|s_1, a_1)p_\theta(a_2\|s_2)p(s_3\|s_2, a_2)\dots\\=P(s_1)\prod_{t=1}^{T}p_\theta(a_t\|s_t)p(s_{t+1}\|s_t, a_t)\)
更新 \(\theta\) 找到最大的 reward: \(\max R(\tau)=\sum_t^Tr_t\)
- Env & Reward 為隨機變數，因此只能算期望值： \(\bar{R}_\theta = \sum R(\tau)p_\theta(\tau) = E_{\tau\sim p_\theta(\tau)}[R(\tau)]\)

Gradient Ascent: Maximize expected reward

目標：如果某個 (state, action) pair 能讓 trajectory reward 為正，則增加這項出現的機率 \(\nabla \bar{R_\theta}=\frac{1}{N}\sum_n^N\sum_t^{T_n} R(\tau^n)\nabla\log p_\theta(a_t^n\|s_t^n)\)
需要不斷從 trajectory set sample data
Tips for implementation
- 將 state(input) & action(output) 視為分類問題
  - 一般分類問題 objective function：min cross entropy = max log likelihood
  - 轉換成 RL 時要多乘上 reward \(\frac{1}{N}\sum_n^N\sum_t^{T_n} \nabla\log p_\theta(a_t^n\|s_t^n)\Rightarrow\frac{1}{N}\sum_n^N\sum_t^{T_n} R(\tau^n)\nabla\log p_\theta(a_t^n\|s_t^n)\)
- Add a baseline
  - 情境：當遊戲規則裡不會得到負分時；且有些 action 不會被 sample 到，導致 agent 低估這個action 能帶來的 reward
  - 解法：讓 reward 不要總是正的 \(\rightarrow\) Reward - baseline(ex. \(\bar{\tau^n}\))
- Assign Suitable Credit
  - 問題：就算某個 pair 的表現是不好的，但在總 reward 為正的情況下，還是會被提升出現的機率
  - 解法：只計算執行此 action 後的 reward
  - 新的更新方式： \(\frac{1}{N}\sum_n^N\sum_t^{T_n} (R(\tau^n)-b)\nabla\log p_\theta(a_t^n\|s_t^n)\)
    - b 取決於 state
    - \((R(\tau^n)-b)\) 為 advantage function \(A^\theta(s_t, a_t)\)

Proximal Policy Optimization, PPO

Off-Policy
- On-policy vs. Off-policy
  - On: agent 和環境互動＋學習
  - Off: 學習的 agent \(\neq\) 互動的 agent，aka agent 在旁邊看別人玩（並學）
- On-policy 的問題：更新 model 後需要重新 sample data
- Off-policy 如何解決：sample \(\pi_{\theta'}\) 來訓練 \(\pi_\theta \rightarrow\) reuse sampled data
  
  Importance Sampling
  無法直接從 p(x) 抽樣的情況下，可藉由另一個分部 q(x) 抽樣且 sample q 次數夠多則可越趨近於 p(x) \(E_{x\sim p}[f(x)]=E_{x\sim q}[f(x)\frac{p(x)}{q(x)}]\) \(\frac{p(x)}{q(x)}\)為修正從 q 抽樣的調整項
- Off-policy 做 gradient ascent 時的調整 \(\nabla \bar{R_\theta}=E_{\tau\sim p_\theta(\tau)}[R(\tau)\nabla\log p_\theta(\tau)]\rightarrow E_{\tau\sim p_{\theta'}(\tau)}[\frac{p_\theta(\tau)}{p_{\theta'}(\tau)}R(\tau)\nabla\log p_\theta(\tau)]\)
- New Objective Function \(J^{\theta'}(\theta)=E_{(s_t,a_t)\sim \pi_{\theta'}(\tau)}[\frac{p_\theta(a_t\|s_t)}{p_{\theta'}(a_t\|s_t)}A^{\theta'}(s_t, a_t)]\)
  - \(\theta'\): demo
  - \(\theta\): update
PPO: 限制 \(\pi\) 與 \(\pi'\) 間的相似度
- \[J_{PPO}^{\theta'}=J^{\theta'}(\theta)-\beta\cdot KL(\theta, \theta')\]
PPO2:

Q Learning(Value-based)

Learn Critic: 評價 actor \(\pi\) 做得多好（而不是 state）

State value function \(V^\pi(s)\)
- 定義：given s，到結束的累積期望 reward
- Estimate \(V^\pi(s)\)
  - Monte-Carlo based
    - 看到 state a 回傳 cumulated gain a
    - 不可能掃過所有 state \(\rightarrow V^\pi(s)\) 視為 regression
  - Temporal-difference appoarch \(\because s_t, a_t, r_t \rightarrow s_{t+1}, a_{t+1}, r_{t+1}\)
  - MC vs. TD: 不同手段估出來的結果可能不同
    - MC: larger variance，因為每次得到的 state & gain 有抽樣的隨機性，且是計算累積 reward
    - TD: smaller variance，因為每次只計算一個 r，但 \(V^\pi\) 可能估的不準
State-action value function \(Q^\pi(s, a)\)
- 定義：在 state s 強制採取 action a，後續採用 \(\pi\) 的累積期望 reward （agent 不一定會採用 a）
- 使用 Q function 更新 \(\pi\) 的流程
Tips:
- Target Network
- Exploration
  - 原因：Q function 有點類似 regression（無隨機性），不適合用於 sample data
  - 解法：
    - Epsilon Greedy:
    - Boltzmann Exploration
- Replay Buffer
  - 減少和環境互動的次數
  - 可提升 batch data diversity
  - Off-policy: 因為過去的 exp 不一定來自 \(\pi\)
Advanced Tips
- Double DQN
  - 問題：DQN 會高估 Q value
    - 估計誤差 \(\rightarrow\) 選到高估的 action \(\rightarrow\) target值會被設太高
  - 解法：選 action 的 Q function \(\neq\) 算 Q value 的 Q function
- Dueling DQN
  - 改變 network 架構，\(Q(s, a)\rightarrow Q(s, a)=V(s)+A(s, a)\)
  - 提升估計 Q value 的效率，不用每個 pair 都被 sample
  - \(\because\) 更新 \(V(s)\) 比 sample 有效率
  - A 的 column 合為 0，確保更新 V
- Priority Reply
  - 從 buffer 裡選擇 TD error 較大的 experience set
  - 改變 sample data 的 distribution & training process