文章目錄

Quick Facts

Documentaton

PPO受到與TRPO相同的問題的激勵：我們如何才能使用當前擁有的數據在策略上採取最大可能的改進步驟，而又不會走得太遠而導致意外導致性能下降？在TRPO試圖通過複雜的二階方法解決此問題的地方，PPO是一階方法的族，它使用其他一些技巧來使新策略接近於舊策略。 PPO方法實施起來非常簡單，並且從經驗上看，其性能至少與TRPO相同。

PPO有兩種主要變體：PPO-penalty和PPO-clip。

PPO-Penalty 近似解決了像TRPO這樣的受KL約束的更新，但是對目標函數中的KL偏離進行了懲罰，而不是使其成爲硬約束，並且在訓練過程中自動調整了懲罰係數，以便對其進行適當縮放。
PPO-Clip 在目標中沒有KL散度項，也沒有任何約束。取而代之的是依靠對目標函數的專門裁剪來消除新政策消除舊政策的激勵。

這裏我們將聚焦PPO-clip(OpenAI)

Quick Facts

PPO是一個on-policy算法
PPO能用於離散或者連續的動作空間
Spinningup的PPO支持用MPI並行

Key Equations

PPO-clip更新策略通過： $\theta_{k+1} = \arg \max_{\theta} \underset{s,a \sim \pi_{\theta_k}}{{\mathrm E}}\left[ L(s,a,\theta_k, \theta)\right],$ 一般用多步(通常minibatch)SGD去最大化目標。這裏的 $L$ 是 $L(s,a,\theta_k,\theta) = \min\left( \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)} A^{\pi_{\theta_k}}(s,a), \;\; \text{clip}\left(\frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)}, 1 - \epsilon, 1+\epsilon \right) A^{\pi_{\theta_k}}(s,a) \right),$ 其中 $\epsilon$ 是一個較小的超參數，大概地描述新策略與舊策略相距多遠。

這是一個較複雜的表述，很難一眼看出它是怎麼做的或是如何有助於保持新策略接近舊策略。事實證明，此目標有一個相當簡化的版本[1]，較容易解決（也是我們在代碼中實現的版本）： $L(s,a,\theta_k,\theta) = \min\left( \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)} A^{\pi_{\theta_k}}(s,a), \;\; g(\epsilon, A^{\pi_{\theta_k}}(s,a)) \right),$ 其中 $g(\epsilon, A) = \left\{ \begin{array}{ll} (1 + \epsilon) A & A \geq 0 \\ (1 - \epsilon) A & A < 0. \end{array} \right.$ 爲了弄清楚從中得到的直覺，讓我們看一下單個狀態-動作對 $(s,a)$ ，並考慮多個案例。

Advantage is positive 假設該狀態-動作對的優勢爲正，在這種情況下，其對目標的貢獻減少爲 $L(s,a,\theta_k,\theta) = \min\left( \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)}, (1 + \epsilon) \right) A^{\pi_{\theta_k}}(s,a).$ 因爲優勢是正的，所以如果採取行動的可能性更大，即 $\pi_\theta(a|s)$ 增加，則目標也會增加。但是此術語中的最小值限制了目標可以增加多少。當 $\pi_\theta(a|s)>(1+\epsilon)\pi_{\theta_k}(a|s)$ ,這個式子達到 $(1+\epsilon)A^{\pi_{\theta_k}}(s,a)$ 的上限.因此新政策不會因遠離舊政策而受益。
Advantage is negative: 假設該狀態對對的優勢爲負，在這種情況下，其對目標的貢獻減少爲 $L(s,a,\theta_k,\theta) = \max\left( \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)}, (1 - \epsilon) \right) A^{\pi_{\theta_k}}(s,a).$ 因爲優勢是負面的，所以如果行動的可能性降低，即 $\pi_\theta(a|s)$ 減小，則目標將增加。但是這個式子的最大值限制了目標增加的多少。當 $\pi_\theta(a|s)<(1-\epsilon)\pi_{\theta_k}(a|s)$ , 式子達到最大值 $(1-\epsilon)A^{\pi_{\theta_k}}(s,a)$ .因此，再次：新政策不會因遠離舊政策而受益。

到目前爲止，我們看到的是clipping作爲一種調節器消除策略急劇變化的激勵，而超參數ε則對應於新政策與舊政策的距離有多遠，同時仍然有利於實現目標。

[1]https://drive.google.com/file/d/1PDzn9RPvaXjJFZkGeapMHbHGiWWW20Ey/view?usp=sharing

儘管這種clipping對確保合理的策略更新大有幫助，但仍然有可能最終產生與舊策略相距太遠的新策略，並且不同的PPO實現使用了很多技巧來避免這種情況關。在此處的實現中，我們使用一種特別簡單的方法：提前停止。如果新政策與舊政策的平均KL差距超出閾值，我們將停止採取梯度步驟。
如果您對基本的數學知識和實施細節感到滿意，則有必要查看其他實施以瞭解它們如何處理此問題！

Exploration vs. Exploitation

PPO以一種基於策略的方式訓練隨機策略。這意味着它將根據最新版本的隨機策略通過採樣操作來進行探索。動作選擇的隨機性取決於初始條件和訓練程序。在培訓過程中，由於更新規則鼓勵該策略利用已發現的獎勵，因此該策略通常變得越來越少隨機性。這可能會導致策略陷入局部最優狀態。

Pseudocode

Documentaton

spinup.ppo(env_fn, actor_critic=, ac_kwargs={}, seed=0, steps_per_epoch=4000, epochs=50, gamma=0.99, clip_ratio=0.2, pi_lr=0.0003, vf_lr=0.001, train_pi_iters=80, train_v_iters=80, lam=0.97, max_ep_len=1000, target_kl=0.01, logger_kwargs={}, save_freq=10)
Parameters:

env_fn – A function which creates a copy of the environment. The environment must satisfy the OpenAI Gym API.
actor_critic – A function which takes in placeholder symbols for state, x_ph, and action, a_ph, and returns the main outputs from the agent’s Tensorflow computation graph:
ac_kwargs (dict) – Any kwargs appropriate for the actor_critic function you provided to PPO.
seed (int) – Seed for random number generators.
steps_per_epoch (int) – Number of steps of interaction (state-action pairs) for the agent and the environment in each epoch.
epochs (int) – Number of epochs of interaction (equivalent to number of policy updates) to perform.
gamma (float) – Discount factor. (Always between 0 and 1.)
clip_ratio (float) – Hyperparameter for clipping in the policy objective. Roughly: how far can the new policy go from the old policy while still profiting (improving the objective function)? The new policy can still go farther than the clip_ratio says, but it doesn’t help on the objective anymore. (Usually small, 0.1 to 0.3.)
pi_lr (float) – Learning rate for policy optimizer.
vf_lr (float) – Learning rate for value function optimizer.
train_pi_iters (int) – Maximum number of gradient descent steps to take on policy loss per epoch. (Early stopping may cause optimizer to take fewer than this.)
train_v_iters (int) – Number of gradient descent steps to take on value function per epoch.
lam (float) – Lambda for GAE-Lambda. (Always between 0 and 1, close to 1.)
max_ep_len (int) – Maximum length of trajectory / episode / rollout.
target_kl (float) – Roughly what KL divergence we think is appropriate between new and old policies after an update. This will get used for early stopping. (Usually small, 0.01 or 0.05.)
logger_kwargs (dict) – Keyword args for EpochLogger.
save_freq (int) – How often (in terms of gap between epochs) to save the current policy and value function.

Mystery_zu

發佈了43 篇原創文章 · 獲贊 7 · 訪問量 2萬+

私信關注

PPO-強化學習算法

文章目錄

Quick Facts

Key Equations

Exploration vs. Exploitation

Pseudocode

Documentaton

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

強化學習-Vanilla Policy Gradient(VPG)

DDPG-強化學習算法

CentOS7 Change the Sources of yum(剛裝完centos後一定要乾的事)

GIT的基本操作(建立自己的git遠程倉庫)

ubuntu系統靜態路由

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結