多加速器驱动AGX的目标检测与车道分割

Object Detection and Lane Segmentation Using Multiple Accelerators with DRIVE AGX

自动驾驶汽车需要快速、准确地感知周围环境，以便同时实时完成一系列广泛的任务。系统需要在各种环境、条件和情况下处理障碍物检测、确定车道边界、交叉口检测和多个功能之间的标志识别，并在汽车设置的功率限制范围内快速完成这项工作。DRIVE AGX平台是专门为满足这些要求而设计的。

驱动平台由Xavier SoC提供动力，采用NVIDIA GPU和多种其他加速器来分散计算负载，适用于安全标准，如ISO 26262/ASIL-D、ISO/PAS 21448。这些加速器包括64位基于ARM的八核CPU、一个集成的Volta GPU、可选的离散图灵GPU、两个深度学习加速器（DLA）、多个可编程视觉加速器（PVA）以及一系列其他ISP和视频处理器。这篇文章将深入到一个应用程序中，该应用程序同时运行两个深度学习模型，对图像进行对象识别和自我车道分割。我们在DL4AGX项目中的Apache许可下发布了这个应用程序，DL4AGX项目是一个用于工具和应用程序的开放源代码项目，用于为NVIDIA AGX平台开发支持深度学习的软件。

Easy Optimized Inference Pipelines Using TensorRT and DALI

使用TensorRT和DALI的简易优化推理管道

TensorRT

TensorRT是NVIDIA的高性能深度学习推理平台。它包括一个深度学习推理优化器和运行时，为深度学习推理应用程序提供低延迟和高吞吐量。它允许更大和更复杂的模型在延紧急迟/计算受限的应用程序（如自动驾驶）中实用。

我们之前讨论了通过使用TensorRT将网络量化为8位整数表示（INT8）来优化为自动驾驶场景中的对象分割而设计的语义分割模型的过程。它减少了运行推断所需的资源，并在GPU中利用专门的硬件。这类优化通常用于减少推理的延迟，同时保持模型的准确性。它有助于在有限的可用资源范围内完成所需的所有计算。

TensorRT 5.1不仅能够优化驱动AGX系统上可用GPU的模型，而且能够在FP16和量化INT8操作精度下优化Xavier SoC内的集成深度学习加速器（DLA）。这允许开发人员充分利用Xavier中内置的所有计算功能。

DALI

然而，运行深度学习模型的任务并不局限于模型本身。预处理和后处理是整个推理管道的关键组成部分。DALI库提供了一个高度优化的图像处理原始集合和一个基于图的执行引擎。这使得将用高效GPU内核实现的多个操作串在一起变得很容易。

跨多个设备管理计算可能很困难。需要处理同步、资源管理和计算依赖性等考虑事项。基于图的执行引擎可以自然地安排这些计算，提供数据，并允许库担心依赖关系图。资源管理和数据移动。

Merging DALI and TensorRT

TensorRT提供了自动驾驶应用程序所需的快速推理。DALI提供了快速的预处理和简单的计算图管理方法。将这些工具包一起使用似乎是很自然的，这样可以加速预处理和推断。我们通过提供TensorRT操作符的DALI插件来实现这种集成。此运算符将优化的TensorRT引擎直接放置到DALI计算图中，使用并生成与其他DALI操作相同的数据格式。现在DALI可以管理加速的预处理、优化的推理和整个管道中的数据传输。我们已经在DL4AGX项目中的Apache许可下发布了TensorRT推理运算符的源代码。它已在x86_64-linux、aarch64-linux（驱动器AGX和Jetson AGX）和aarch64 qnx（驱动器AGX）上得到验证。有关更多信息，请查看插件的自述文件。下面我们将查看使用此运算符的应用程序。

Fast Concurrent Object Detection and Lane Segmentation on Heterogeneous Hardware Using DALI and TensorRT

现在我们可以有一个完全加速的推理管道。让我们使用这种结构和TensorRT在GPU和DLA上运行的能力，以及DALI的数据管理能力，在单独的加速器上同时在相同的数据上运行两个模型，从而更有效地利用Xavier SoC及其板载加速器。这允许我们重叠相关的任务，如同时进行车道分割和目标检测。

Concurrent inference on multiple different accelerators

对于同时进行车道分割和目标检测的例子，有一个共同的源图像，但需要对每个模型分别进行预处理。因此，我们可以构建一个计算图，如图1所示：

Figure 1. Data Pipeline for concurrent lane segmentation and object detection. The green block represent tasks running on the GPU, yellow ops run on the DLA and blue on the CPU.

预处理是在GPU上使用DALI的内核完成的，每个arm的推理运行在不同的加速器上。此图是MultiDeviceInferencePipeline示例中的推理应用程序实现的。

Walking through the computational graph

此应用程序使用配置文件设置计算图的所有不同设置。浏览这个文件是一个很好的方法，可以一个节点一个节点地浏览这个图形。配置文件定义带注释图像的输入数据和输出位置。它充当使用图像流作为输入的替身，将结果输入到一些高级世界表示或其他a/V用例中。

# From example_conf.toml

# Paths for I/O

input_image = "/path/to/input_image.jpg"

output_image = "/path/to/output_image.jpg"

在本例中，需要配置管道的下一个分支分段分支。我们为分段选择的模型设计为完全在DLA上运行（这里可以参考DLA支持的层列表，以帮助确定模型兼容性）。您可以看到图中的这个arm被设置为使用DLA Core 1。您还应该注意，设备（GPU）也设置为iGPU的设备号，在我们的例子中是1。这是用于与DLA接口的GPU设备。在Xavier SoC上，iGPU是DLA的唯一接口，因此TensorRT必须瞄准它。但是，DLA仍然处理计算。此外，请注意用于配置DALI执行引擎的设置。特别是，看看异步执行设置，这意味着计算图的两个分支不会相互阻塞。

# Configurations for the Segmentation Pipeline

[[inference_pipeline]]

name = "Segmentation"

device = 1                    # ID of GPU acting as DLA Bridge

dla_core = 1                 # Target device (DLA Core 1)

batch_size = 1

async_execution = true   # enable asynchronous execution of branches of pipeline

num_threads = 1             # CPU Thread pool size

下面是图形分割臂的预处理设置。这些值配置优化的GPU内核，这些内核将调整图像大小并使其规格化，以获得TensorRT引擎的输入分辨率。

[inference_pipeline.preprocessing]

# Image Pre-processing parameters

resize = [3, 240, 795]

mean = [0.0, 0.0, 0.0]

std_dev = [1.0, 1.0, 1.0]

接下来是TensorRT引擎本身，它以序列化的TensorRT引擎的形式被使用（这里它被保存到文件系统上的一个文件中）。以下各节详细介绍了网络的输入和输出头信息。

[inference_pipeline.engine]

# Path to TensorRT Engine

path = "experiments/deeplabv2_res18_small_240x795_int8_DLA.engine"

[[inference_pipeline.engine.inputs]]

# Name and shape of the model input tensor

name = "ImageTensor"

shape = [3, 240, 795]

[[inference_pipeline.engine.outputs]]

# Name and shape of the model output tensor

name = "logits/semantic/BiasAdd"

shape = [2, 15, 50]

这就是分割所需arm的全部配置。接下来，我们添加另一个[[推断管道]]实例来配置对象检测。（注意：为了源代码的清晰性，该应用程序不是完全通用的；它不处理任意一组管道。但是，在现有代码中实现这一点非常简单。）

我们再次看到相同的高级设置。在这里，我们将dla_core设置为-1，这意味着不要将dla用于此引擎，而是使用GPU（GPU设备1，Xavier上的iGPU）。由于另一个网络在DLA上运行，这两个网络实际上将在不同的设备上运行，即使它们在这里共享一个设备号。

[[inference_pipeline]]

name = "Object Detection"

device = 1                    # Target device (Xavier iGPU)

dla_core = -1                 # Disable DLA for this engine

batch_size = 1

async_execution = true     # enable asynchronous execution of branches of pipeline

num_threads = 1              # CPU Thread pool size

我们再次设置预处理的参数、引擎文件的路径及其输入和输出头信息。

[inference_pipeline.preprocessing]

# Image Pre-processing parameters

resize = [3, 300, 300]

mean = [127.5, 127.5, 127.5]

std_dev = [127.5, 127.5, 127.5]

[inference_pipeline.engine],

# Path to TensorRT Engine

path = "experiments/SSD_resnet18_kitti_int8_iGPU.engine"

[[inference_pipeline.engine.inputs]]

# Name and shape of the model input tensor

name = "Input"

shape = [3, 300, 300]

这里我们看到这个网络有多个输出头。这是TensorRT NMS插件的结果，但是DALI能够相应地处理这个问题。

[[inference_pipeline.engine.outputs]]

# Name and shape of the model output tensor

name = "NMS"

shape = [1, 100, 7]

[[inference_pipeline.engine.outputs]]

# Name and shape of additional model output tensor (specific to TRT NMS Plugin)

name = "NMS_1"

shape = [1, 1, 1]

我们在这里还看到了一个自定义插件的使用，这个插件实现了一个flattencat操作符，SSD网络使用了这个操作符。

[[inference_pipeline.engine.plugins]]

# Path to TensorRT Plugin for FlattenConcat Op (see //plugins/TensorRT/FlattenConcat)

path = "/bazel-bin/plugins/FlattenConcatPlugin/libflattenconcatplugin.so"

综合起来，整个配置文件表示上图所示的完整管道。

Generic Inference Pipeline Implementation

在这篇博文中，我们一直在不断地回到这个基本的预处理->推理原始子图。现在让我们看看如何使用DALI和TensorRT一起实现这个原始的泛型版本。这个DALITRTPipeline类为配置文件中的每个推理管道条目进行配置和实例化，并用作子图的封装，接收原始解码的jpeg并从模型返回结果。这个构造函数包含创建这个基本子图的主要逻辑（其余的组件主要是DALI周围的小包装）。

// From DALITRTPipeline.cpp

DALITRTPipeline::DALITRTPipeline(const std::string pipelinePrefix,

                                preprocessing::PreprocessingSettings preprocessingSettings,

                                std::string TRTEngineFilePath,

                                std::vector pluginPaths,

                                std::vector engineInputBindings,

                                std::vector engineOutputBindings,

                                const int deviceId,

                                const int DLACore,

                                const int numThreads,

                                const int batchSize,

                                const bool pipelineExecution,

                                const int prefetchQueueDepth,

                                const bool asyncExecution)

管道的构建从这里开始，在新的DALI管道中添加一个输入节点和arm的预处理步骤。

this->inferencePipeline = new dali::Pipeline(batchSize, numThreads,

                                               deviceId, seed,

                                               pipelineExecution,

                                               prefetchQueueDepth,

                                               asyncExecution); //max_num_stream may become useful here

   //Hardcoded Input node

   const std::string externalInput = "decoded_jpegs";

   this->inputs.push_back({externalInput, "cpu"});

   this->inferencePipeline->AddExternalInput(externalInput);

   const std::vector<std::pair<std::string, std::string>> preprocessingOutputNodes = {std::make_pair("preprocessed_images", "gpu")};

   //Single function to append the preprocessing steps to the pipeline (modify this function in preprocessingPipeline/pipeline.h to change these steps)

   preprocessing::AddOpsToPipeline(this->inferencePipeline, pipelinePrefix,

                                   this->inputs[0], preprocessingOutputNodes,

                                   preprocessingSettings, true);

您可以看到下面的实际操作。首先，将图像大小调整为模型的输入大小，然后进行规格化。默认情况下，GPU通过DALI处理这两个操作。

//From preprocessing.h

inline void AddOpsToPipeline(dali::Pipeline* pipe,

                           const std::string prefix,

                           const std::pair<std::string, std::string> externalInput,

                           const std::vector<std::pair<std::string, std::string>> pipelineOutputs,

                           const preprocessing::PreprocessingSettings& settings,

                           bool gpuMode)

   int nChannel = settings.imgDims[0]; //Channels

   int nHeight = settings.imgDims[1];  //Height

   int nWidth = settings.imgDims[2];   //Width

   std::string executionPlatform = gpuMode ? "gpu" : "cpu";

   pipe->AddOperator(

       dali::OpSpec("Resize")

          .AddArg("device", executionPlatform)

          .AddArg("interp_type", dali::DALI_INTERP_CUBIC)

          .AddArg("resize_x", (float) nWidth)

          .AddArg("resize_y", (float) nHeight)

          .AddArg("image_type", dali::DALI_RGB)

          .AddInput("decoded_jpegs", executionPlatform)

          .AddOutput("resized_images", executionPlatform),

       prefix + "_Resize");

   pipe->AddOperator(

       dali::OpSpec("NormalizePermute")

          .AddArg("device", executionPlatform)

          .AddArg("output_type", dali::DALI_FLOAT)

          .AddArg("mean", settings.imgMean)

          .AddArg("std", settings.imgStd)

          .AddArg("height", nHeight)

          .AddArg("width", nWidth)

          .AddArg("channels", nChannel)

          .AddInput("resized_images", executionPlatform)

          .AddOutput(pipelineOutputs[0].first, pipelineOutputs[0].second),

       prefix + "_NormalizePermute");

最后，我们使用用于DALI的TensorRT插件将TensorRT引擎添加到管道中。引擎是从序列化的TensorRT引擎摄取的。此时，我们设置输入、输出和插件，以及引擎和推理运行时的设置。

   //Read in TensorRT Engine

   std::string serializedEngine;

   utils::readSerializedFileToString(TRTEngineFilePath, serializedEngine);

   dali::OpSpec inferOp("TensorRTInfer");

   inferOp.AddArg("device", "gpu")

      .AddArg("inference_batch_size", batchSize)

      .AddArg("engine", serializedEngine)

      .AddArg("plugins", pluginPaths)

      .AddArg("num_outputs", engineOutputBindings.size())

      .AddArg("input_nodes", engineInputBindings)

      .AddArg("output_nodes", engineOutputBindings)

      .AddArg("log_severity", 3);

选择使用DLA或GPU实际上是我们插件提供的TensorRT运算符的参数。

// Decide whether to use a DLA for the engine or not

   if (DLACore >= 0)

       inferOp.AddArg("use_dla_core", DLACore);

   for (auto& in : preprocessingOutputNodes)

       inferOp.AddInput(in.first, "gpu");

   for (auto& out : engineOutputBindings)

       inferOp.AddOutput(out, "gpu");

       this->outputs.push_back({out, "gpu"});

   std::cout << "Registering " << pipelinePrefix << " TensorRT Op" << std::endl;

   this->inferencePipeline->AddOperator(inferOp);

鉴于这个简单的原语及其配置，我们现在构建完整的多设备管道。

第一阶段是在一个真正的自主车辆应用程序中从文件系统加载图像，这很可能是一个图像流。图像首先被解码，然后通过DALI制作两份拷贝，以供给管道的每一个分支。

std::cout << "Load JPEG images" << std::endl;

   dali::TensorList JPEGBatch;

   utils::makeJPEGBatch(settings.inFiles, &JPEGBatch, settings.batchSize);

   JPEGPipeline.SetPipelineInput(JPEGBatch);

   JPEGPipeline.RunPipeline();

   std::vector<dali::TensorList*> detInputBatch;

   std::vector<dali::TensorList*> segInputBatch;

   JPEGPipeline.GetPipelineOutput(detInputBatch, segInputBatch);

我们将图像输入到管道的arms中，从图像被解码的点开始。DALI处理CPU和GPU之间的所有内存管理和数据传输。

   // Load this image into the pipeline (note there is no cuda memcpy yet as

   // JPEG decoding is done CPU side, DALI will handle the memcpy between ops

   std::cout << "Load into inference pipelines" << std::endl;

   detPipeline.SetPipelineInput(detInputBatch);

   segPipeline.SetPipelineInput(segInputBatch);

实际的执行在设置输入之后开始。arm异步运行，因此一个arm中的操作不会阻止另一个arm中的操作。GetPipelineOutput调用充当一个屏障，在后处理之前同步两个arm。

// Run the inference pipeline on both the GPU and DLA

   // While this is done serially in the app context, when the pipelines are built

   // with AsyncExecution enabled (default), the pipelines themselves will run concurrently

   std::cout << "Starting inference pipelines" << std::endl;

   detPipeline.RunPipeline();

   segPipeline.RunPipeline();

   // Now setting a blocking call for the pipelines to synchronize the pipeline executions

   std::cout << "Transferring inference results back to host for postprocessing" << std::endl;

   std::vector<dali::TensorList*> detPipelineResults;

   std::vector<dali::TensorList*> segPipelineResults;

   detPipeline.GetPipelineOutput(detPipelineResults);

   segPipeline.GetPipelineOutput(segPipelineResults);

最后，数据被复制回来，并被解包以进行后期处理和最终可视化（在完整的AV应用程序中，它可能会继续更新世界表示）。

// Copy data back to host

   std::vector detNMSOutput(conf::bindingSize(settings

                                                        .pipelineBindings[kDET_PIPELINE_NAME]

                                                        .outputBindings["NMS"]),

0);

   std::vector detNMS1Output(conf::bindingSize(settings

                                                          .pipelineBindings[kDET_PIPELINE_NAME]

                                                          .outputBindings["NMS_1"]),

0);

   std::vector segOutput(conf::bindingSize(settings

                                                      .pipelineBindings[kSEG_PIPELINE_NAME]

                                                      .outputBindings["logits/semantic/BiasAdd"]),

0);

   utils::GPUTensorListToRawData(detPipelineResults[0], &detNMSOutput);

   utils::GPUTensorListToRawData(detPipelineResults[1], &detNMS1Output);

   utils::GPUTensorListToRawData(segPipelineResults[0], &segOutput);

Results

图2显示了一个系统生成的示例，一个带检测的带标记的图像和一个车道分割。这个管道不输出图像，而是可以在更真实的应用程序中为自动驾驶系统的其他组件提供数据。

Figure 2. Output from the application showing an annotated image with bounding boxes and segmentation mask

使用DALI和TensorRT加速推理，通过充分利用驱动AGX上的硬件，在实际模型执行和预处理中产生显著的性能加速，如图3所示。

Figure 3. Performance speed up due to reduced precision inference and preprocessing acceleration

本文研究了如何使用DALI和TensorRT更简单地管理异构计算管道。但是，我们也看到了通过将这两个库一起使用而显著的性能改进。我们在Xavier SoC上测试了使用DALI和TensorRT实现的ResNet-18模型管道。在这种情况下，通过TensorRT在GPU上执行推理。由于使用了GPU加速预处理，我们显示了1.57x的加速，并且通过使用量化INT8而不是FP32执行模型，性能提高了3.5倍。

除了通过DALI和TensorRT一起使用，更好地利用Xavier SoC上的各种加速器外，我们还提高了计算本身的性能。

Extending This Concept

如图1所示，基本的两个模型示例展示了这种推理方法的潜力。从这里可以扩展到更宽的范围，包括图形中的更多模型，或者使用不同的输入设置，如立体图像对，如图4所示。有可能进一步扩展DALI，让运营商利用Xavier上的其他加速器，如PVA。使用上面我们详细介绍的通用组件，以充分利用DRIVE AGX计算能力的方式实现这些不同的系统不需要太多努力。

Figure 4. Potential other inference graph topologies implementable with the same common primitives as the example above.

Trying it for yourself

TensorRT DALI集成的应用程序源代码、模型培训配方和源代码是开放源码的，并在多设备接口管道目录中的DL4AGX repo中发布，并且已经在驱动器AGX（QNX和Linux）、Jetson AGX和x86ʂ64（使用多个GPU而不是GPU+DLA）上进行了测试。您将找到有关如何在KITTI数据集上训练对象检测和车道分割模型、交叉编译目标硬件的所有应用程序并将这些模型转换为TensorRT引擎以用于此管道以及应用程序的实际使用的详细说明。

Creating and Running Applications on DRIVE AGX

DL4AGX项目继续为各种AGX平台开发深度学习工具和应用程序。它基于使用Bazel和Docker的容器化构建基础设施，允许交叉编译这些应用程序和工具变得非常简单和容易设置。支持的环境包括基于DRIVE AGX PDKs和Jetpack/Jetson AGX的环境。请关注此repo，寻找新的工具和应用程序，这些工具和应用程序将有助于为AGX平台开发更简单的深度学习应用程序。

References

[Geiger et al. 2013] Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research, 32(11), 1231-1237.

[Chen et. al 2017] Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4), 834-848.

[Liu et. al 2016] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016, October). Ssd: Single shot multibox detector. In European conference on computer vision (pp. 21-37). Springer, Cham.

[Smolyanskiy et al. 2018] Smolyanskiy, N., Kamenev, A., & Birchfield, S. (2018). On the importance of stereo for accurate depth estimation: An efficient semi-supervised deep neural network approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 1007-1015).

[Redmon et. al. 2018] Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767.