基于TensorRT的BERT实时自然语言理解（上）

大规模语言模型（LSLMs）如BERT、GPT-2和XL-Net为许多自然语言理解（NLU）任务带来了最先进的精准飞跃。自2018年10月发布以来，BERT1（来自Transformer的双向编码器表示）仍然是最流行的语言模型之一，并且在编写时仍能提供最先进的精准。

BERT为NLU任务的准确性提供了一个飞跃，使得基于语言的高质量服务在许多行业的公司都能达到。要在生产中使用模型，除了精准之外，还需要考虑延迟等因素，这些因素会影响最终用户对服务的满意度。由于BERT是一个12/24层的多层多头注意网络，在推理过程中需要大量的计算。到目前为止，这给公司在实时应用程序中部署BERT带来了挑战。

NVIDIA发布了针对BERT的新的TensorRT优化，允许您在t4gpu上执行2.2ms*的推理。这比仅使用CPU的平台快17倍，而且在对话式人工智能应用程序所需的10ms延迟预算之内。这些优化使得在生产中使用BERT变得切实可行，例如，作为会话AI服务的一部分。 TensorRT是一个用于高性能深度学习推理的平台，它包括一个优化器和运行时，可以最大限度地减少延迟并最大化生产中的吞吐量。使用TensorRT，您可以优化在所有主要框架中训练的模型，以高精度校准较低精度，最后在生产中部署。

在这个TensorRT示例repo中，所有用于使用BERT实现此性能的优化和代码都将作为开放源代码发布。我们已经优化了Transformer层，它是BERT编码器的基本构建块，因此您可以将这些优化应用于任何基于BERT的NLP任务。BERT被应用于会话式人工智能之外的一组扩展的语音和NLP应用程序，所有这些都可以利用这些优化。

问答（QA）或阅读理解是测试模型理解上下文能力的一种非常流行的方法。SQuAD leaderboard3排行榜3跟踪此任务的最佳执行者，以及他们提供的数据集和测试集。在过去的几年里，随着学术界和公司的全球贡献，质量保证能力有了快速的发展。在本文中，将演示如何使用Python创建一个简单的问答应用程序，它由我们今天发布的TensorRT优化BERT代码提供支持。这个例子提供了一个API来输入段落和问题，并返回由BERT模型生成的响应。

从使用TensorRT for BERT执行训练和推理的步骤开始。

BERT Training and Inference Pipeline

NLP研究人员和开发人员面临的一个主要问题是缺乏针对其特定NLP任务的高质量标记训练数据。为了克服从零开始学习任务模型的问题，NLP的最新突破是利用大量未标记的文本，将NLP任务分解为两个部分：1）学习表示单词的含义和它们之间的关系，即使用辅助任务和大量文本语料库建立语言模型；2）通过使用一个相对较小的任务特定网络，以有监督的方式训练语言模型，使语言模型专门化为实际任务。

这两个阶段通常称为预训练和微调。这种范式允许将预先训练的语言模型用于广泛的任务，而不需要对模型体系结构进行任何特定于任务的更改。在我们的例子中，BERT提供了一个高质量的语言模型，该模型针对问答进行了微调，但也适用于其他任务，如句子分类和情感分析。

要对BERT进行预训练，您可以从在线提供的预训练检查点（图1（左））开始，也可以在您自己的自定义语料库上预训练BERT（图1（右））。还可以从检查点初始化预训练，然后继续自定义数据。虽然使用定制或领域特定数据进行预训练可能会产生有趣的结果（例如BioBert5），但它是计算密集型的，需要大量并行计算基础设施才能在合理的时间内完成。GPU支持的多节点训练是此类场景的理想解决方案。

在“用GPU训练伯特”博客中了解NVIDIA开发人员如何在不到一个小时内训练Bert。

在微调步骤中，使用特定于任务的训练数据（对于问答，这是（段落，问题，答案）三倍）训练基于预先训练的BERT语言模型的任务特定网络。请注意，与预训练相比，微调通常在计算上的要求要少得多。

使用QA神经网络进行推理：

Create a TensorRT engine by passing the fine-tuned weights and network definition to the TensorRT builder.
Start the TensorRT runtime with this engine.
Feed a passage and a question to the TensorRT runtime and receive as output the answer predicted by the network.

This entire workflow is outlined in Figure 2

Figure 1: Generating BERT TensorRT engine from pretrained checkpoints

Figure 2: Workflow to perform inference with TensorRT runtime engine for BERT QA task

Let’s Run the Sample!

Set up your environment to perform BERT inference with the steps below:

Create a Docker image with the prerequisites
Compile TensorRT optimized plugins
Build the TensorRT engine from the fine-tuned weights
Perform inference given a passage and a query

We use scripts to perform these steps, which you can find in the TensorRT BERT sample repo. While we describe several options you can pass to each script, you could also execute the code below at the command prompt to get started quickly:

# Clone the TensorRT repository, check out the specific release, and navigate to the BERT demo directory

git clone --recursive https://github.com/NVIDIA/TensorRT && cd TensorRT/ && git checkout release/5.1 && cd demo/BERT

# Create and launch the Docker image

sh python/create_docker_container.sh

# Build the plugins and download the fine-tuned models

cd TensorRT/demo/BERT && sh python/build_examples.sh

# Build the TensorRT runtime engine

python python/bert_builder.py -m /workspace/models/fine-tuned/bert_tf_v2_base_fp16_384_v2/model.ckpt-8144 -o bert_base_384.engine -b 1 -s 384 -c /workspace/models/fine-tuned/bert_tf_v2_base_fp16_384_v2

Now, give it a passage and see how much information it can decipher by asking it a few questions.

python python/bert_inference.py -e bert_base_384.engine -p "TensorRT is a high performance deep learning inference platform that delivers low latency and high throughput for apps such as recommenders, speech and image/video on NVIDIA GPUs. It includes parsers to import models, and plugins to support novel ops and layers before applying optimizations for inference. Today NVIDIA is open sourcing parsers and plugins in TensorRT so that the deep learning community can customize and extend these components to take advantage of powerful TensorRT optimizations for your apps." -q "What is TensorRT?" -v /workspace/models/fine-tuned/bert_tf_v2_base_fp16_384_v2/vocab.txt -b 1

Passage: TensorRT is a high performance deep learning inference platform that delivers low latency and high throughput for apps such as recommenders, speech and image/video on NVIDIA GPUs. It includes parsers to import models, and plugins to support novel ops and layers before applying optimizations for inference. Today NVIDIA is open sourcing parsers and plugins in TensorRT so that the deep learning community can customize and extend these components to take advantage of powerful TensorRT optimizations for your apps.

Question: What is TensorRT?

Answer: 'a high performance deep learning inference platform'

—- Given the same passage with a different question —-

Question: What is included in TensorRT?

Answer: 'parsers to import models, and plugins to support novel ops and layers before applying optimizations for inference'

模型提供的答案是准确的，基于所提供的文章文本。该示例使用FP16精度执行TensorRT推理。这有助于实现NVIDIA GPU中张量核心的最高性能。在我们的测试中，我们测量了TensorRT的精确度，与框架内推理的FP16精度相当。

让我们检查脚本的可用选项。create_docker_容器.sh脚本使用BERT示例中提供的Dockerfile构建Docker映像，它基于NGC中的TensorRT容器。它安装所有必要的包并启动创建的映像bert_tensorrt，作为一个正常运行的容器。将脚本执行为：

sh create_docker_container.sh

在创建环境之后，为BERT下载经过微调的权重。请注意，创建TensorRT引擎不需要预先训练的权重（只需要微调权重）。除了微调权重之外，还可以使用相关的配置文件，该文件指定诸如注意文件头、层数和vocab.txt文件，其中包含从训练过程中学习到的词汇。它们与从NGC下载的微调模型一起打包；使用构建来下载它们_示例.sh脚本。作为这个脚本的一部分，您可以为想要下载的BERT模型指定一组经过微调的权重。命令行参数控制精确的BERT模型，该模型将在以后用于模型构建和推理，可以如下所示使用：

Usage: sh build_examples.sh [base | large] [ft-fp16 | ft-fp32] [128 | 384]

base | large – determine whether to download a BERT-base or BERT-large model to optimize
ft-fp16 | ft-fp32 – determine whether to download a BERT model fine-tuned with precision FP16 or FP32
128 | 384 – determine whether to download a BERT model for sequence length 128 or 384

Examples

# Running with default parameters

sh build_examples.sh

# Running with custom parameters (BERT-large, FP132 fine-tuned weights, 128 sequence length)

sh build_examples.sh large ft-fp32 128

此脚本将首先使用示例存储库中的代码，并为BERT推断构建TensorRT插件。接下来，它下载并安装NGC CLI，从NVIDIA的NGC模型库下载一个经过微调的模型。生成的命令行build_examples.sh指定要使用TensorRT优化的模型。默认情况下，它下载经过微调的BERT-base，精度为FP16，序列长度为384。

除了微调模型外，我们还使用配置文件枚举模型参数和词汇表文件，用于将BERT模型输出转换为文本答案。在为所选模型下载模型和配置信息之后，将为TensorRT构建BERT插件。这些插件的共享对象文件放在BERT推理示例的build目录中。

接下来，我们可以构建TensorRT引擎并将其用于问答示例（即推理）。脚本bert_builder.py基于下载的BERT微调模型构建TensorRT推理引擎。它使用在上一步中构建的定制TensorRT插件，以及下载的经过微调的模型和配置文件。确保提供给此脚本的序列长度与下载的模型的序列长度匹配。按如下方式使用脚本：

Usage:python bert_builder.py -m <checkpoint> -o <bert.engine> -b <batch size> -s <sequence length> -c <config file_directory>

-m, – checkpoint file for the fine-tuned model
-o, – path for the output TensorRT engine file (i.e. bert.engine)
-b, – batch size for inference (default=1)
-s, – sequence length matching the downloaded BERT fine-tuned model
-c, – directory containing configuration file for BERT parameters (attention heads, hidden layers, etc.)

Example:

你现在应该有一个TensorRT引擎（即bert.engine)在推理脚本中使用（bert_inference.py)对于QA。我们将在后面的章节中描述构建TensorRT引擎的过程。现在您可以向bert提供一个段落和一个bert_inference.py并查看模型是否能够正确回答您的查询。与推理脚本交互的方法很少：段落和问题可以作为命令行参数提供（使用passage and –question标志），也可以从给定文件传入（使用–passage_file and –question_file标志）。如果在执行过程中没有给出这两个标志，则在执行开始后，将提示用户输入段落和问题。bert_inference.py脚本如下：

Usage: python bert_inference.py --bert_engine <bert.engine> [--passage | --passage_file] [--question | --question_file] --vocab_file <vocabulary file> --batch_size <batch_size>

-e, –bert_engine – path to the TensorRT engine created in the previous step
-p, –passage – text for paragraph/passage for BERT QA
-pf, –passage_file – file containing text for paragraph/passage
-q, –question – text for query/question for BERT QA
-qf, –question_file – file containing text for query/question
-v, –vocab_file – file containing entire dictionary of words
-b, –batch_size – batch size for inference

人工智能芯片与自动驾驶