Neural Approaches to Conversational AI

学姐介绍的一篇综述阅读笔记

SIGIR 2018

主要贡献：

提出一个综合的调查关于最近几年的应用在QA，任务导向和闲聊对话机器人的神经网络方法

描述了现在和传统方法之间的联系，允许我们更好的理解研究为什么并且如何进化并且散发光在我们前行的道路上

提出先进的方法去训练对话数据，使用监督学习和强化学习的方法

概述在搜索对话和工业中使用的标志性对话系统，研究我们已经达到的成就和仍需面对的挑战

对话：

task completion: agent need to accomplish user tasks

social chat: agent need to converse appropriately with users - like a human as measured by the Turing test - and provide useful recommendations

bots : task-oriented & chitchat

dialogue as optimal decision making

连续的决策制定过程。它有一个自然的结构：一个高层次的选择使得客服去激活某一种特别的子任务，和一个低层次的过程，又被选择的客服控制，选择最初的动作去完成子任务。

这样的层次决定制定过程能被马尔科夫决策过程（MDPs）描述，这里的选择被定义为primitive actions to higher-level actions。这是对传统的MDP的扩展，原来的MDP设置为一个客服在每一个时间步只能选择一个primitive action，新的MDPs可以选择一个“multi-step” action。

如果我们将每一个option看作一个action，那么top- & low- level 都能够自然的被强化学习框架捕捉。对话agent在MDP中导航，通过一系列离散的step to interact with its environment。在每一个时间步，agent会observe the current state, and chooses an action 根据policy。agent之后会收到reward，然后observe a new state，继续这个循环直到阶段终点。goal of dialogue learning 是去发现最佳策略去最大化expected rewards。

sounding board: a social chatbots designed to maximize user engagement , measured by the expected reward function of conversation-turns per session (CPS)。

混合方法去组合不同ML方法的力量，例如，我们可能会使用仿造品或者监督学习方法。

the transition of NLP to Neural Approaches

NLP应用不同于其他数据处理系统，在语言知识应用的多样性方面，包括音系学，词态学，语法，语义和论述。这些组合任务可以被看作是自然语言的模糊性在不同程度上，通过匹配一个自然语言句子到一系列人为定义的清楚的符号表达，例如POS（part of speech），context free grammar, first-order predicate calculus.

代替的，端到端的系统聚焦在小心的裁剪增长的神经网络复杂度

机器学习背景

supervised learning (SL)

mean squared error (MSE)

stochastic gradient descent (SGD)

在未知领域，agent要学会通过和环境互动进而自己去学习如何进行下一步动作，这就是强化学习（RL）。1：agent必须要充分利用已经知道的知识为了去获得奖励，但同时也必须要去探索未知的领域和知识为了在未来有更好的行动选择。2：delay reward and temporal credit assignment:agent 直到session的结束都不知道一个对话是否是成功的完成了任务，因此他必须确定在sequence中的哪个动作会对最后的奖励获得有作用，这个问题就是temporal credit assignment。3：partially observed states：神经网络学习通过编码所有的在现在轮和过去步获得的信息去表示状态

和过去的技术相比，神经网络方法提供一个更有效的解决方案通过利用深度神经网络的表示学习力量。

deep learning

multi-layer perceptron (MLP) inputs/outputs/hidden layers

deep neural networks (DDN)

information retrieval (IR)

设计一个深度学习分类器的主要努力是在优化神经网络结构for effective representation learning。

convolutional layers for local words dependencies & recurrent for global word sequences

deep semantic similarity model (DSSM)

reinforcement learning

agent-environment interaction is modeled as a discrete-time Markov decision process (MDP), described by a five-tuple M=<S,A,P,R,Y>

s：环境能够存在的可能无限大的状态集合；a：agent可能存在的可能无限大的状态集合；p(s'|s,a) 在动作 a 后环境的状态从 s 转移到 s' 转移概率；r(s,a) 在agent再状态 s 执行完动作 a 后 agent 立刻接受到的平均奖励；y 属于 0 到 1，左开右闭

transition : (s,a,r,s'), action-selection policy denoted by π (确定或者随机)

![1](E:fourth_year_in_buptpaper笔记 eural approach to conversational AI1.JPG)

Q-learning

第一种算法基于一个理论：一个优化策略能被立刻检索到如果优化 Q 功能是可获得的，优化策略被描述为：
$$
π^(s) = mathop {argmax}_{a}Q^(s,a).
$$
因此，一大部分的强化学习的算法聚焦在学习$Q^*(s,a)$上，统称为value-function-based methods.

在实际中，用一个表格去表示 Q(s,a)是昂贵的，每一个 (s,a) 一条记录。因此我们通常使用压缩形式去表示 Q。特别的，我们假设 Q-function 有一个预定义好的参数形式，一个线性近似的例子是：
$$
Q(s,a; heta) = phi(s,a)^T heta
$$
其中的$phi (s,a)$是一个 d-dimensional hand-coded feature vector for state-action pair (s,a), $ heta$是从数据中学到的相关系数向量。一般来讲$Q(s,a; heta)$有许多不同的表达形式。例如 deep Q-network(DQN)。进一步，Q-function的表达也可以使用非参数的形式，例如决策树或者是 Gaussian processes。在发现一个state transition (s,a,r,s')后$ heta$作如下更新

![2](E:fourth_year_in_buptpaper笔记 eural approach to conversational AI2.JPG)

上面公式就是Q-learning，$ abla$求梯度。

Q-learning通常是不稳定的且需要许多example在达到最优解Q*之前。两个修改可以帮助改善这个，第一个是 experience replay, 代替仅仅使用一次 observed transition to update $ heta$, one may store it in a reply buffer, and 周期性从中采样去执行Q-learning update。这种方法使得每个transaction都能多次利用，而且这也帮助学习过程更加稳定通过避免数据分布当更新参数的时候改变太过剧烈。

第二个就是 two-network implementation。这里，learner maintains an extra copy of the Q-function, called the target network, parameterized by $ heta_{target}$ .学习过程中，这个参数是fixed

![3](E:fourth_year_in_buptpaper笔记 eural approach to conversational AI3.JPG)

周期的，$ heta_{target}$ 被更新为 $ heta$ ，the process continues. 这是固定值迭代算法的一个例子

dueling Q-network / double Q-learning / SBEED

POLICY Gradient

另一种算法是直接去优化policy，不用必须去学习Q-function。policy本身directly parameterized by $ heta in mathbb{R}^d$ ,$pi(s; heta)$ is often a distribution over actions.policy被它在H长度的轨道，$ au = (s_1,a_1,r_1,...,s_H,a_H,r_H)$，中得到的平均的长期reward评估:
$$
J( heta):=mathbb{E}[sum^{H}{t = 1}gamma^{t-1}r_t|a_tsimpi(s_t; heta)].
$$
从采样的trajectory中去估计$ abla heta J$是可能的，通过随机梯度下降去最大化J：
$$
heta leftarrow heta+alpha abla_ heta J( heta),
$$
这个算法就是REINFORCE:

![4](E:fourth_year_in_buptpaper笔记 eural approach to conversational AI4.JPG)

actor-critic算法：因为上式是直接求和，所以方差的变化可能会很剧烈，通过使用一种 estimated value function of the current policy. often referred to as the critic in 这个算法：

![5](E:fourth_year_in_buptpaper笔记 eural approach to conversational AI5.JPG)

moreover，仍然有许多工作要做去研究怎么计算 $ abla_ heta J $ 更有效，比起上式中的剧烈的下降 steepest descent.

exploration

update value function or policy, when transitions are given as input. 一个强化学习agent 也应该学会怎么去select actions to collect desired transitions for learning. 选择一个全新的action叫作 exploration。这是有风险的，所以exploration 和 exploitation 之间的平衡非常重要。

一种基础 exploration 策略是 $epsilon - greedy$ . 主要思想是去选择一个看起来最好的高概率的 action (for exploitation)，和一个随机的小概率的 action (for exploration)。DQN情况下，假设 $ heta$ 是现在 Q-function 的参数，然后状态的 action-selection 规则就是：
$$
a_t=egin{cases}
mathop {argmax}_{a}Q(s_t,a; heta) & with probabilityquad 1 - epsilon
random action &with probabilityquad epsilon
end{cases}
$$

Question Answering and Machine Reading Comprehension

KB (knowledge base) QA / text-QA

Implicit ReasoNet (IRN) / M-walk - KBQA

machine reading comprehension (MRC) :

encoding questions and passages as vectors in a neural space
performing reasoning推理 in the neural space to generate the answer

TREC QA open benchmarks

knowledge base

DBPedia, Freebase, Yago

subject-predicate-object troples (s, r, t) -> knowledge graph (KG)

semantic parsing for KB-QA

paraphrasing in natural language : embedding-based methods

search complexity : multi-step reasoning

embedding-based methods

knowledge base completion (KBC) task：predicting the existence of a triple that is not seen in the KB. (whether a faxt is true or not)

bilinear model：the model scores how likely a triple holds using: $score(s,r,t; heta)=x_s^T W_r x_t$ ,$x_e in mathbb{R}^d$ 是对于每个实体学到的vector，$W_r$矩阵是对于每个关系来说学到的。每个真实在KB库中的就是正样本，负样本由毁坏实体关系对中的任意一项得到。the margin-based（基于边际的）loss :
$$
L( heta)=sum_{(x^+,x-)in mathcal{D}} [gamma+score(x^-; heta) - score(x^+; heta)],
$$
$[x]_+:=max(0,x) $

path queries : an initial anchor entity s, a sequence of relations to be traversed $(r_1,...,r_k)$ ,how likely a path query $(q,t)$ holds, $q=(s,r_1,...,r_k)$ :
$$
score(q,t)=x_s^TW_{r_1} ...W_{r_k}x_t.
$$

multi-step reasoning on KB

knowledge base reasoning (KBR) ：relational paths $pi = (r_1,..,r_k)$, 下面介绍的三种方法不同之处在于推理执行在离散符号空间还是连续神经空间。

symbolic methods