关系抽取----远程监督 --《 Fine-tuning Pre-Trained Transformer Language Models to Distantly Supervised Relation Extraction》

一。概述

　　远程监督的关系抽取目前的聚焦点在如何去消除噪音。主要方法有多实例的学习方法和提供语言或语境的信息去引导关系分类。尽管取得了sota，但是这些模型都只是在有限的关系集合中取得高的精度，而忽视了关系有很多种，模型缺少泛化能力。

　　对此，本文提出了一种基于预训练语言模型的远程监督方法。

　　由于GPT和及其相似地的其他预语言模型已经被证明能够捕获语义和语法特征，以及显著数量的“常识”知识，我们假设这些是识别更多样化关系集的重要特征。

　　首先在plain text的文本进行预训练，之后在 NYT10 数据集上进行微调。证明在很大的关系集合上都取得的很高的置信度。

二。相关工作。

　　1.关于关系抽取RE

　　　　① 多实例学习的方法：

　　　　　　Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In ACL/IJCNLP.

　　　　　　Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D. Manning. 2012. Multi-instance Multi-label Learning for Relation Extraction. In Proceedings of the 2012 Joint Conference on 　　　　Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL ’12, pages 455–465, Stroudsburg, PA, USA. Association for Computational Linguistics

　　　　　　Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2016. Neural Relation Extraction with Selective Attention over Instances. In Proceedings of the 54th Annual Meeting of the 　　　　Association for Computational Linguistics (Volume 1: Long Papers), pages 2124–2133, Berlin, Germany. Association for Computational Linguistics.

　　　　② 增添（提供）语义和语法知识：

　　　　　　Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. 2014. Relation classification via convolutional deep neural network. In Proceedings of COLING 2014, the 25th International 　　　　Conference on Computational Linguistics: Technical Papers, pages 2335–2344. Dublin City University and Association for Computational Linguistics.

　　　　　　Yuhao Zhang, Peng Qi, and Christopher D. Manning. 2018b. Graph Convolution over Pruned Dependency Trees Improves Relation Extraction. In Proceedings of the 2018 Conference on Empirical 　　　　Methods in Natural Language Processing, pages 2205– 2215, Brussels, Belgium. Association for Computational Linguistics.

　　　　③ 利用一些侧面信息（比如：关系别名，实体类型。。。。）

　　　　　　Shikhar Vashishth, Rishabh Joshi, Sai Suman Prayaga, Chiranjib Bhattacharyya, and Partha Talukdar. 2018. Reside: Improving distantly-supervised neural relation extraction using side information. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1257–1266. Association for Computational Linguistics.　

　　2. 关于语言模型LM

　　　　Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008. 提出transfromer.

　　　 Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. available as a preprint. （可以证明在微调过程中引入语言建模作为辅助目标可以提高泛化能力并加快收敛速度，在损失函数中应用了，见三.3中的loss function）

　　　　Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke S. Zettlemoyer. 2018. Deep contextualized word representations. In NAACL-HLT.

　　　　这两个论文证明可以仅通过无监督的预训练，就捕获有用的语义和语法属性

　　　　Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. Computing Research Repository 　　(CoRR), abs/1810.04805. （此论文加上上面3个，证明在广泛的自然语言处理任务上的获得最新性能）

　　　　Christoph Alt, Marc Hubner, and Leonhard Hennig. ¨ 2019. Improving relation extraction by pre-trained language representations. In Proceedings of the 2019 Conference on Automated Knowledge BaseConstruction, Amherst, Massachusetts. 证明在有监督的关系抽取中也取得很好的性能。

　　　　Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. 发现语言模型在回答开放领域的问题时表现得相当好，而没有经过实际任务的训练，这表明它们只掌握了一定的“常识”知识。

　　因此，作者假设，预训练的语言模型为远程监控提供了更强的信号，基于非监督预训练过程中获得的知识更好地指导关系提取。用隐式特征代替显式语言和侧信息可以提高领域独立性和语言独立性，增加识别关系的多样性，最小化显式特征提取并减少错误积累的风险。

三。 Transformer Language Model

　　1.Transformer-Decoder

　　　首先，transformer-decoder（如图），是transfomer的decoder的一种变种。由masked multi-head self-attention和 feed forward操作组成

　　　　h0可以是第一个模块的输入，其中，T是一个矩阵，有多少个token,T就有多少行，每一行是一个one-hot向量，代表这个token在这个句子的位置。We是这个句子的embedding矩阵,Wp是这个位置embedding矩阵。transformer-decoder有L层。

　　2. Unsupervised Pre-training of Language Representations（不解释）

四. Multi-Instance Learning with the Transformer （正文微调）

　　1. Distantly Supervised Fine-tuning on Relation Extraction

　　　　关于selective attention，可以参考刘知远教授2016年的论文（Neural Relation Extraction with Selective Attention over Instances）。这里只给出公式。

　　　　那么，如果把一个bag的句子集合用S表示，那么就可以这么说，给定一个S,和训练的参数，最大化这个实体对的关系label.其中，Wr表示关系的矩阵，b是偏置。

　　　　那么，在微调的时候，我们的损失函数可以这么设定：

　　　　由于Radford在论文中《 Improving language understanding by generative pre-training》说的，在微调过程中引入语言建模作为辅助目标可以提高泛化能力并加快收敛速度，那么损失函数可以这么设定：

　　2.Input Representation

　　　关于模型的输入，还是有一定学问的。

　　　首先，把一个sentence 进行token.算法是 byte pair encoding (BPE）详情可以看 http://www.luyixian.cn/news_show_73867.aspx

　　 [关系][entity1][sep][entity2][sep][sentence][clf]

　　　　clf可以作为最后的生成用于关系分类的句子表示。

四。实验

　　在NYT 实验。

　　预训练（原因：预训练在计算上是昂贵的，我们的主要目标是通过对远程监督关系提取任务的微调来显示其有效性）; 直接使用了 Radford的《Improving language understanding by generative pre-training》的模型，使用的语料是

　　　　Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. 2015 IEEE International Conference on Computer Vision (ICCV), pages 19– 27.

　　超参数设定：

　　　　使用 Adam optimization β1 = 0.9, β2 = 0.999

　　　　a batch size of 8

　　　　a learning rate of 6.25e-5

　　　　3 epochs

　　　　attention dropout with a rate of 0.1

　　　　classifier dropout with a rate of 0.2