什么是Q Learning?

根据Q表估计

 

a1(选择1的赋值)

a2(选择2的赋值)

s1(行动1)

-2

1

s2(行动2)

-4

2

Q Learning算法(Q Learning Alogrithm):

#以下为伪代码

Initialize Q(s, a) arbitrarily
    Repeat (for each episode):
        Initialize s

        Repeat (for each step of episode):
            Choose a from s using policy derived from Q (e.g., ε-greedy)
            Take action a, observe r, s'
            Q(s, a) <- Q(s, a) + a[r + γmaxa' Q(s', a') – Q(s, a)]
            s <- s';

    until s is terminal     

递推关系

原文地址:https://www.cnblogs.com/ljmjjy0820/p/7896212.html