DQN改進
DQN算法存在過估計問題,可以採用Double DQN方法來進行補償。兩種方法只在下圖不同,其他地方一致。下圖公式爲 q_target
的輸出值,
DQN:
Double DQN:
Policy Gradient
Policy gradient是基於策略的強化學習,該方法是存儲每一輪的s,a,r值,用以計算梯度。
這裏面,表示選擇對應動作的概率,後面的表示對應的時刻的加上未來衰減的。一個基於policy gradient的pytorch程序如下:
import torch
import gym
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import torch.optim as optim
import math
from torch.distributions import Categorical
import matplotlib.pyplot as plt
from itertools import count
env = gym.make('CartPole-v1')
class PG_network(nn.Module):
def __init__(self):
super(PG_network,self).__init__()
self.linear1 = nn.Linear(4,128)
self.dropout = nn.Dropout(p=0.6)
self.linear2 = nn.Linear(128,2)
# self
# self.optimizer = optim.Adam(self.parameters(),lr=1e-2)
def forward(self,x):
x = self.linear1(x)
x = self.dropout(x)
x = F.relu(x)
action_scores = self.linear2(x)
# x = self.dropout(x)
# x = F.relu(x).unsqueeze(0)
# x = x.unsqueeze(0)
return F.softmax(action_scores,dim=1)
# maxvalue,index = torch.max(x,dim=1)
# y = x.squeeze(0)
# action_random = np.random.choice(y.detach().numpy())
# print(action_random)
# return x
policyG_object = PG_network()
optimizer = optim.Adam(policyG_object.parameters(),lr=1e-2)
possibility_store = []
r_store = []
def choose_action(s):
s = torch.from_numpy(s).float().unsqueeze(0)
probs = policyG_object(s)
m = Categorical(probs)
action = m.sample()
b = m.log_prob(action)
possibility_store.append(m.log_prob(action))
return action.item()
alpha = 0.9
gammar = 0.9
reward_delay = 0.9
eps = np.finfo(np.float64).eps.item()
# R_store = []
def policy_gradient_learn():
R = 0
R_store = []
delta_store = []
# theta = -torch.log10()
for r in r_store[::-1]:
R = r + reward_delay*R
R_store.insert(0,R)
R_store = torch.tensor(R_store)
R_store = (R_store - R_store.mean())/(R_store.std()+eps)
for p,v in zip(possibility_store,R_store):
delta_store.append(-p*v)
optimizer.zero_grad()
delta_store = torch.cat(delta_store).sum()
delta_store.backward()
optimizer.step()
del possibility_store[:]
del r_store[:]
# print(loss)
def main():
running_reward = 10
for i_episode in count(1):
s, ep_reward = env.reset(),0
for t in range(1,10000):
# env.render()
a = choose_action(s)
s,r,done,info = env.step(a)
r_store.append(r)
ep_reward += r
# print(r,a)
if done:
break
running_reward = 0.05 * ep_reward + (1 - 0.05) * running_reward
policy_gradient_learn()
if i_episode % 10 == 0:
print('Episode {}\tLast reward: {:.2f}\tAverage reward: {:.2f}'.format(
i_episode, ep_reward, running_reward))
if running_reward > env.spec.reward_threshold:
print("Solved! Running reward is now {} and "
"the last episode runs to {} time steps!".format(running_reward, t))
# torch.save(policy.state_dict(),'hello.pt')
if __name__ == '__main__':
main()
Actor-Critic
Q-learning不能夠在連續空間進行使用,Policy gradient則可以。但是Policy gradient是回合更新的,這就使得學習效率大大降低(多說一句,policy gradient是on policy策略,這種做法比較浪費樣本數據)(Policy gradient的方法學習效果爲方差較大)。那麼就產生了兩種的結合體,Actor-critic的形式。在Actor-critic中,Actor採用Policy gradient的學習策略,而critic則採用以值爲基礎單步更新方式(TD-error),以此來更新policy gradient網絡。
但是Actor-critic存在不收斂的問題,因此,在Actor-critic的基礎上,提出了DDPG的策略。DDPG即deep deterministic policy gradient,該網絡融合了DQN和Actor-critic。