学习笔记 | Morvan

Q-learning

Auxiliary Material

  1. A Painless Q-Learning Tutorial

  2. Simple Reinforcement Learning with Tensorflow Part 0: Q-Learning

  3. 6.5 Q-Learning: Off-Policy TD Control (Sutton and Barto's Reinforcement Learning ebook)

Note

  1. tabular

    扁平的,表格式的

  2. Q-learning by Morvan

    Q-learning 是一种记录行为值 (Q value) 的方法, 每种在一定状态的行为都会有一个值 Q(s, a), 就是说 行为 a 在 s 状态的值是 Q(s, a).

    s 在上面的探索者游戏中, 就是 o 所在的地点了. 而每一个地点探索者都能做出两个行为 left/right, 这就是探索者的所有可行的 a 啦.

  3. Q-learning by Wikipedia

    Q-learning is a model-free reinforcement learning technique. Specifically, Q-learning can be used to find an optimal action-selection policy for any given (finite) Markov decision process (MDP). It works by learning an action-value function that ultimately gives the expected utility of taking a given action in a given state and following the optimal policy thereafter.

    When such an action-value function is learned, the optimal policy can be constructed by simply selecting the action with the highest value in each state. One of the strengths of Q-learning is that it is able to compare the expected utility of the available actions without requiring a model of the environment. Additionally, Q-learning can handle problems with stochastic transitions and rewards, without requiring any adaptations.

  4. Psudocode

  5. The transition rule of Q learning is a very simple formula (source):

    Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]

  6. epsilon greedy

    EPSILON 就是用来控制贪婪程度的值。EPSILON 可以随着探索时间不断提升(越来越贪婪)。

  7. Why is Q-learning considered an off-policy control method? (Exercise 6.9 of Sutton and Barto book's)

    If the algorithm estimates the value function of the policy generating the data, the method is called on-policy. Otherwise it is called off-policy.

    if the samples used in the TD update is not generated according to your behavior policy (policy that the agent is following) then it is called off-policy learning--you can also say learning from off-policy data. (source)

    Q-learning 是一个 off-policy 的算法, 因为里面的 max action 让 Q table 的更新可以不基于正在经历的经验(可以是现在学习着很久以前的经验,甚至是学习他人的经验).

原文地址:https://www.cnblogs.com/casperwin/p/6305351.html