HandsOn Reinforcement Learning With Python——Temporal Difference Learning

Hands-On Reinforcement Learning With Python——Temporal Difference Learning

作者：凯鲁嘎吉 - 博客园 http://www.cnblogs.com/kailugaji/

更多请看：Reinforcement Learning - 随笔分类 - 凯鲁嘎吉 - 博客园 https://www.cnblogs.com/kailugaji/category/2038931.html

$\epsilon $-贪心策略：

1. SARSA

1.1 算法流程

1.2 Python程序

# -*- coding: UTF-8 -*-
# Solving the Taxi Problem using SARSA
# From: https://github.com/AndyYue1893/Hands-On-Reinforcement-Learning-With-Python
# https://www.cnblogs.com/kailugaji/ - 凯鲁嘎吉 - 博客园
'''
出租车调度
这里有 4 个地点，分别用 4 个字母表示，任务是要从一个地点接上乘客，送到另外 3 个中的一个放下乘客，越快越好。
颜色：蓝色：乘客，红色：乘客的目的地，黄色：空出租车，绿色：出租车满座，其中 “:” 栅栏可以穿越，"|" 栅栏不能穿越
Reward:  成功运送一个客人获得 20 分奖励
         每走一步损失 1 分（希望尽快送到目的地）
         没有把客人放到指定的位置，损失 10 分
Action: 0：向南移动，1：向北移动，2：向东移动，3：向西移动，4：乘客上车，5：乘客下车
State:  500维，（出租车行、出租车列、乘客位置、目的地）
'''
import random
import gym
from time import sleep

env = gym.make('Taxi-v3') #创建出租车游戏环境
env.render() # 用于渲染出当前的智能体以及环境的状态

# 将Q表初始化为一个字典，它存储指定在状态s中执行动作a的值的状态-动作对。
Q = {}
for s in range(env.observation_space.n):
    for a in range(env.action_space.n):
        Q[(s,a)] = 0.0

# epsilon贪心策略函数
def epsilon_greedy(state, epsilon):
    if random.uniform(0,1) < epsilon:
        return env.action_space.sample() # 随机，用epsilon概率探索新动作
    else:
        return max(list(range(env.action_space.n)), key = lambda x: Q[(state,x)]) # 用1-epsilon的概率选择Q表最佳动作

# 初始化变量
alpha = 0.4 # TD学习率
gamma = 0.999 # 折扣率
epsilon = 0.017 # 贪心策略中epsilon的值
num_episodes = 1000 # 玩几局游戏

# 执行SARSA
for episode in range(num_episodes): # 玩几局游戏
    steps, r = 0, 0  # 每局走多少步，总体奖励
    state = env.reset() # 用于重置环境
    # select the action using epsilon-greedy policy
    action = epsilon_greedy(state, epsilon)
    while True:
        steps += 1 # 每局走多少步
        env.render() # 用于渲染出当前的智能体以及环境的状态
        # then we perform the action and move to the next state, and receive the reward
        nextstate, reward, done, _ = env.step(action)
        # again, we select the next action using epsilon greedy policy
        nextaction = epsilon_greedy(nextstate, epsilon)
        # we calculate the Q value of previous state using our update rule
        Q[(state, action)] += alpha * (reward + gamma * Q[(nextstate, nextaction)] - Q[(state, action)])
        # Q(s, a) <- Q(s, a) + alpha (r + gamma Q(s', a') - Q(s, a))
        # finally we update our state and action with next action and next state
        action = nextaction # a <- a'
        state = nextstate # s <- s'
        # store the rewards
        r += reward # reward: 即时奖励, r: total reward
        # we will break the loop, if we are at the terminal state of the episode
        if done:
            break
    print(f"Episode: {episode + 1}")  # 玩几局游戏
    print(f"Epochs: {steps}")  # 每局走多少步
    print(f"State: {state}")
    print(f"Action: {action}")
    print(f"Reward: {reward}")
    print("Total Reward: ", r)
    # sleep(0.01) # 为了让显示变慢，否则画面会非常快
env.close()

1.3 结果

2. Q-Learning

2.1 算法流程

2.2 Python程序

# -*- coding: UTF-8 -*-
# Solving the Taxi Problem using Q Learning
# From: https://github.com/AndyYue1893/Hands-On-Reinforcement-Learning-With-Python
# https://www.cnblogs.com/kailugaji/ - 凯鲁嘎吉 - 博客园
'''
出租车调度
这里有 4 个地点，分别用 4 个字母表示，任务是要从一个地点接上乘客，送到另外 3 个中的一个放下乘客，越快越好。
颜色：蓝色：乘客，红色：乘客的目的地，黄色：空出租车，绿色：出租车满座，其中 “:” 栅栏可以穿越，"|" 栅栏不能穿越
Reward:  成功运送一个客人获得 20 分奖励
         每走一步损失 1 分（希望尽快送到目的地）
         没有把客人放到指定的位置，损失 10 分
Action: 0：向南移动，1：向北移动，2：向东移动，3：向西移动，4：乘客上车，5：乘客下车
State:  500维，（出租车行、出租车列、乘客位置、目的地）
'''
import random
import gym
from time import sleep

env = gym.make('Taxi-v3') #创建出租车游戏环境
env.render() # 用于渲染出当前的智能体以及环境的状态

# 将Q表初始化为一个字典，它存储指定在状态s中执行动作a的值的状态-动作对。
q = {}
for s in range(env.observation_space.n):
    for a in range(env.action_space.n):
        q[(s,a)] = 0.0

# 定义一个名为update_q_table的函数，根据q学习更新规则更新q值
def update_q_table(prev_state, action, reward, nextstate, alpha, gamma):
    qa = max([q[(nextstate, a)] for a in range(env.action_space.n)]) # 取一个状态-动作对的最大值，并将其存储在一个名为qa的变量中
    # max Q(s', a')
    q[(prev_state, action)] += alpha * (reward + gamma * qa - q[(prev_state, action)]) # 用更新规则更新前一个状态的Q值
    # Q(s, a) <- Q(s, a) + alpha (r + gamma max Q(s', a') - Q(s, a))

# epsilon贪心策略函数
def epsilon_greedy_policy(state, epsilon):
    if random.uniform(0,1) < epsilon:
        return env.action_space.sample() # 随机，用epsilon概率探索新动作
    else:
        return max(list(range(env.action_space.n)), key = lambda x: q[(state,x)]) # 用1-epsilon的概率选择Q表最佳动作

# 初始化变量
alpha = 0.4  # TD学习率
gamma = 0.999 # 折扣率
epsilon = 0.017 # 贪心策略中epsilon的值
num_episodes = 1000 # 玩几局游戏

# 执行Q-Learning
for episode in range(num_episodes): # 玩几局游戏
    steps, r = 0, 0 # 每局走多少步，总体奖励
    prev_state = env.reset() # 用于重置环境
    while True:
        steps += 1 # 每局走多少步
        env.render() # 用于渲染出当前的智能体以及环境的状态
        # In each state, we select the action by epsilon-greedy policy
        action = epsilon_greedy_policy(prev_state, epsilon)
        # then we perform the action and move to the next state, and receive the reward
        nextstate, reward, done, _ = env.step(action)
        # Next we update the Q value using our update_q_table function
        # which updates the Q value by Q learning update rule
        update_q_table(prev_state, action, reward, nextstate, alpha, gamma)
        # Finally we update the previous state as next state
        prev_state = nextstate # s <- s'
        # Store all the rewards obtained
        r += reward # reward: 即时奖励, r: total reward
        # we will break the loop, if we are at the terminal state of the episode
        if done:
            break
    print(f"Episode: {episode + 1}") # 玩几局游戏
    print(f"Epochs: {steps}") # 每局走多少步
    print(f"State: {prev_state}")
    print(f"Action: {action}")
    print(f"Reward: {reward}")
    print("Total Reward: ", r)
    # sleep(0.01) # 为了让显示变慢，否则画面会非常快
env.close()

2.3 结果

3. 参考文献

[1] https://github.com/sudharsan13296/Hands-On-Reinforcement-Learning-With-Python

[2] 邱锡鹏，神经网络与深度学习，机械工业出版社，https://nndl.github.io/, 2020.

作者：凯鲁嘎吉

出处：http://www.cnblogs.com/kailugaji/

本文版权归作者和博客园共有，欢迎转载，但未经作者同意必须在文章页面给出原文链接，否则保留追究法律责任的权利。