对马尔科夫决策过程的代码补充解释

  这篇文章是https://www.cnblogs.com/chester-cs/p/12928649.html的代码补充解释,这里我仅仅实现了Value迭代,Policy迭代读者可以尝试自己实现。

  学习了MDP即马尔科夫决策过程之后我就想用代码实现实现,想看看机器是如何不断优化自身的。

  考虑这样一个世界:  有奖励,有陷阱,有阻碍。给Agent发出指令后Agent的行为是非确定的。

  这个问题和更常见的“增强学习问题”的差别在哪里?

  1. 状态空间是有限的,离散的。而有很多“增强学习问题”是状态连续的,状态无穷的

  2. 动作空间是有限的,离散的。而有很多“增强学习问题”是动作连续的

  3. 由(1)(2),我们能够以矩阵形式保存每个状态下每个动作的$Q$值即$Q(s,a)$。当状态、动作空间很大的时候这是不现实的,我们可以采用人工神经网络等方法来拟合$Q$。

  有了以上的思考,我们也不考虑这么多了,看一下这个典型的MDP,优化后是什么样。我写了一个ipynb文件。

## 使用value-iteration来寻找最优策略,其中更改reward-function使得agent采取永远不会输的策略
import numpy as np

P = np.zeros((11,4,11))
V = np.zeros(11)
Q = np.zeros((11,4))
R = np.zeros((11,4))

## 定义转移概率矩阵
P[0] = np.array([[0.15,0.8,0,0.05,0,0,0,0,0,0,0],[0.95,0,0,0.05,0,0,0,0,0,0,0],[0.95,0.05,0,0,0,0,0,0,0,0,0],[0.15,0.05,0,0.8,0,0,0,0,0,0,0]])
P[1] = np.array([[0,0.2,0.8,0,0,0,0,0,0,0,0],[0.2,0.8,0,0,0,0,0,0,0,0,0],[0.05,0.9,0.05,0,0,0,0,0,0,0,0],[0.05,0.9,0.05,0,0,0,0,0,0,0,0]])
P[2] = np.array([[0,0,0.95,0,0.05,0,0,0,0,0,0],[0,0.8,0.15,0,0.05,0,0,0,0,0,0],[0,0.05,0.95,0,0,0,0,0,0,0,0],[0,0.05,0.15,0,0.8,0,0,0,0,0,0]])
P[3] = np.array([[0.05,0,0,0.9,0,0.05,0,0,0,0,0],[0.05,0,0,0.9,0,0.05,0,0,0,0,0],[0.8,0,0,0.2,0,0,0,0,0,0,0],[0,0,0,0.2,0,0.8,0,0,0,0,0]])
P[4] = np.array([[0,0,0.05,0,0.9,0,0,0.05,0,0,0],[0,0,0.05,0,0.9,0,0,0.05,0,0,0],[0,0,0.8,0,0.2,0,0,0,0,0,0],[0,0,0,0,0.2,0,0,0.8,0,0,0]])
P[5] = np.array([[0,0,0,0.05,0,0.1,0.8,0,0.05,0,0],[0,0,0,0.05,0,0.9,0,0,0.05,0,0],[0,0,0,0.8,0,0.15,0.05,0,0,0,0],[0,0,0,0,0,0.15,0.05,0,0.8,0,0]])
P[6] = np.array([[0,0,0,0,0,0,0.15,0.8,0,0.05,0],[0,0,0,0,0,0.8,0.15,0,0,0.05,0],[0,0,0,0,0,0.05,0.9,0.05,0,0,0],[0,0,0,0,0,0.05,0.1,0.05,0,0.8,0]])
P[7] = np.array([[0,0,0,0,0.05,0,0,0.9,0,0,0.05],[0,0,0,0,0.05,0,0.8,0.1,0,0,0.05],[0,0,0,0,0.8,0,0.05,0.15,0,0,0],[0,0,0,0,0,0,0.05,0.15,0,0,0.8]])
P[8] = np.array([[0,0,0,0,0,0.05,0,0,0.15,0.8,0],[0,0,0,0,0,0.05,0,0,0.95,0,0],[0,0,0,0,0,0.8,0,0,0.15,0.05,0],[0,0,0,0,0,0,0,0,0.95,0.05,0]])
P[9] = np.array([[0,0,0,0,0,0,0,0,0,1,0],[0,0,0,0,0,0,0,0,0,1,0],[0,0,0,0,0,0,0,0,0,1,0],[0,0,0,0,0,0,0,0,0,1,0]])
P[10] = np.array([[0,0,0,0,0,0,0,0,0,0,1],[0,0,0,0,0,0,0,0,0,0,1],[0,0,0,0,0,0,0,0,0,0,1],[0,0,0,0,0,0,0,0,0,0,1]])

## 定义reward-function
punishment = -100
reward = 100
R[0] = np.array([0,0,0,0])
R[1] = np.array([0,0,0,0])
R[2] = np.array([0,0,0,0])
R[3] = np.array([0,0,0,0])
R[4] = np.array([0,0,0,0])
R[5] = np.array([0,0,0,0])
R[6] = np.array([0.05*punishment,0.05*punishment,0,0.8*punishment])
R[7] = np.array([0.05*reward,0.05*reward,0,0.8*reward])
R[8] = np.array([0.8*punishment,0,0.05*punishment,0.05*punishment])
R[9] = np.array([0,0,0,0])
R[10] = np.array([0,0,0,0])

for j in range(100):
    for s in range(9):
        for a in range(4):
            Q[s][a] = R[s][a]+0.9*P[s][a]@V
        V[s] = np.max(Q[s])
print('Print policy(With punish:-100,reward:100): ')
for l in range(9):
    print('State' + str(l+1) +': '+ str(np.argmax(Q[l])))

  一共11个状态0-10,定义了每个状态的价值$V$,状态价值矩阵$Q(s,a)$,以及概率转移矩阵$P$。0-3的动作分别表示向上下左右行走。

  价值矩阵是长度为11的向量,状态动作矩阵是11×4的矩阵。这两个都好理解对不对。而状态转移矩阵我定义成了11×4×11,是这样来用的,前一个状态共有11个,那么有4个动作可选,选每个动作后下一个状态又是“非确定的”(因为我这个游戏的限制!)所以转移后的状态也应该是一个向量的概率分布

  我分别用拿到奖杯奖励100,被龙杀死惩罚100。和奖励100惩罚10000,来进行价值迭代,得到的结果是不同的。

Print policy(With punish:-100,reward:100): 
State1: 0
State2: 0
State3: 3
State4: 3
State5: 3
State6: 0
State7: 0
State8: 3
State9: 2

Print policy(With punish:-10000,reward:100): 
State1: 0
State2: 0
State3: 3
State4: 2
State5: 3
State6: 2
State7: 2
State8: 3
State9: 1

  拿状态9来看,惩罚小的时候,agent会大线条的直接向左走,因为这样能更快吃到奖杯啊,被杀死的概率也不大。当惩罚大的时候,agent学谨慎了,一直向下走。因为它学到了向左走有0.05的概率被惩罚到倾家荡产,而向下走虽然慢却稳健安全。

  拿状态4来看,惩罚小的时候,agent会直接向右走,因为这样离奖杯更近。当惩罚大的时候agent显然明白了右边是危险的,直接调转方向向左走了。状态6、7更明显,agent宁愿一直对着墙走也不愿冒被惩罚的风险。

  Agent的智能不就在这迭代中获得了吗~~~

原文地址:https://www.cnblogs.com/chester-cs/p/13403879.html