TensorLayer官方中文文档1.7.4：API – 强化学习

API - 强化学习¶

强化学习（增强学习）相关函数。

`discount_episode_rewards`([rewards, gamma, mode])	Take 1D float array of rewards and compute discounted rewards for an episode.
`cross_entropy_reward_loss`(logits, actions, ...)	Calculate the loss for Policy Gradient Network.
`log_weight`(probs, weights[, name])	Log weight.
`choice_action_by_probs`([probs, action_list])	Choice and return an an action by given the action probability distribution.

奖励函数¶

tensorlayer.rein.discount_episode_rewards(rewards=[], gamma=0.99, mode=0)[源代码]¶

Take 1D float array of rewards and compute discounted rewards for an
episode. When encount a non-zero value, consider as the end a of an episode.

Parameters:

Parameters:	rewards : numpy list a list of rewards gamma : float discounted factor mode : int if mode == 0, reset the discount process when encount a non-zero reward (Ping-pong game). if mode == 1, would not reset the discount process.

rewards : numpy list

a list of rewards

gamma : float

discounted factor

mode : int

if mode == 0, reset the discount process when encount a non-zero reward (Ping-pong game).
if mode == 1, would not reset the discount process.

Examples

>>> rewards = np.asarray([0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1])
>>> gamma = 0.9
>>> discount_rewards = tl.rein.discount_episode_rewards(rewards, gamma)
>>> print(discount_rewards)
... [ 0.72899997  0.81        0.89999998  1.          0.72899997  0.81
... 0.89999998  1.          0.72899997  0.81        0.89999998  1.        ]
>>> discount_rewards = tl.rein.discount_episode_rewards(rewards, gamma, mode=1)
>>> print(discount_rewards)
... [ 1.52110755  1.69011939  1.87791049  2.08656716  1.20729685  1.34144104
... 1.49048996  1.65610003  0.72899997  0.81        0.89999998  1.        ]

损失函数¶

Weighted Cross Entropy¶

tensorlayer.rein.cross_entropy_reward_loss(logits, actions, rewards, name=None)[源代码]¶

Calculate the loss for Policy Gradient Network.

Parameters:

Parameters:	logits : tensor The network outputs without softmax. This function implements softmax inside. actions : tensor/ placeholder The agent actions. rewards : tensor/ placeholder The rewards.

logits : tensor

The network outputs without softmax. This function implements softmax
inside.

actions : tensor/ placeholder

The agent actions.

rewards : tensor/ placeholder

The rewards.

Examples

>>> states_batch_pl = tf.placeholder(tf.float32, shape=[None, D])
>>> network = InputLayer(states_batch_pl, name='input')
>>> network = DenseLayer(network, n_units=H, act=tf.nn.relu, name='relu1')
>>> network = DenseLayer(network, n_units=3, name='out')
>>> probs = network.outputs
>>> sampling_prob = tf.nn.softmax(probs)
>>> actions_batch_pl = tf.placeholder(tf.int32, shape=[None])
>>> discount_rewards_batch_pl = tf.placeholder(tf.float32, shape=[None])
>>> loss = tl.rein.cross_entropy_reward_loss(probs, actions_batch_pl, discount_rewards_batch_pl)
>>> train_op = tf.train.RMSPropOptimizer(learning_rate, decay_rate).minimize(loss)

Log weight¶

tensorlayer.rein.log_weight(probs, weights, name='log_weight')[源代码]¶

Log weight.

Parameters:

Parameters:	probs : tensor If it is a network output, usually we should scale it to [0, 1] via softmax. weights : tensor

probs : tensor

If it is a network output, usually we should scale it to [0, 1] via softmax.

weights : tensor

采样选择函数¶

tensorlayer.rein.choice_action_by_probs(probs=[0.5, 0.5], action_list=None)[源代码]¶

Choice and return an an action by given the action probability distribution.

Parameters:

Parameters:	probs : a list of float. The probability distribution of all actions. action_list : None or a list of action in integer, string or others. If None, returns an integer range between 0 and len(probs)-1.

probs : a list of float.

The probability distribution of all actions.

action_list : None or a list of action in integer, string or others.

If None, returns an integer range between 0 and len(probs)-1.

Examples

>>> for _ in range(5):
>>>     a = choice_action_by_probs([0.2, 0.4, 0.4])
>>>     print(a)
... 0
... 1
... 1
... 2
... 1
>>> for _ in range(3):
>>>     a = choice_action_by_probs([0.5, 0.5], ['a', 'b'])
>>>     print(a)
... a
... b
... b

艾伯特(http://www.aibbt.com/)国内第一家人工智能门户