在OpenShift平台上验证NVIDIA DGX系统的分布式多节点自动驾驶AI训练

自动驾驶汽车的深度神经网络（DNN）开发是一项艰巨的工作。本文验证了DGX多节点，多GPU，分布式训练在DXC机器人驱动环境中运行。

还使用了一个机器人学习平台来驱动深度学习（11.3）的工作负载。目前，OpenShift 3.11已部署在许多大型GPU加速的自动驾驶（AD）开发和测试环境中。这里显示的方法同样适用于新的OpenShift版本，并且可以转移到其他基于OpenShift的集群中。

DXC Robotic Drive是一个自动驾驶的数据驱动开发平台，可大大降低风险，加快ADAS/AD功能的开发、测试和验证，以支持2级以上5级自主功能。它是目前已知的最大的EB级开发解决方案，利用业界成熟的本地和云基础设施、方法、工具和加速器实现高度自动化的广告开发过程。

互操作性测试环境是运行OpenShift 3.11和4.3的机器人驱动创新实验室。

DL workloads at scale

数据并行（dataparallelishm）是最常用的扩展DL工作负载的设计模式。关于如何加速视觉和递归神经网络，有许多参考文献和实践。

DL模型被多次实例化，并且数据在这些实例中并行传输。实例彼此交换渐变，以协同工作，而不是独立工作。

这是来自高性能计算（HPC）领域的消息传递接口（MPI）框架的经典计算模式。因此，在众所周知的MPI的帮助下对这些工作负载进行编排是很简单的。MPI还可以轻松扩展到多个节点之外。

支持多GPU的DL框架，如PyTorch和TensorFlow，在任何项目开始时都非常适合使用，以确保工作负载可以使用单个GPU工作站直到大型GPU集群。

这些框架还支持使用MPI本机进行数据并行训练，并且可以使用MPI工具（如mpirun或mpiexec）触发工作负载。数据并行模式的多种实现都遵循这种模式，比如Horovod。

RedHat OpenShift Container Platform（OCP）是基于Kubernetes的Docker或CRI-O运行时容器构建的平台即服务。OpenShift专注于安全性，并且确实包括了对上游Kubernetes的缺陷、安全性和性能问题的修复。作为Kubernetes，OpenShift允许在RedHat的支持下大规模地部署和管理集群。

Kubernetes和OpenShift可以轻松地处理MPI工作负载。一个集成以Kubeflow MPI操作符的形式存在，它在后台协调资源并提高工作负载。

图1显示了使用两个DGX-1系统的DL工作负载。在这种情况下，有16个单独的进程。在Horovod中，它们被赋予一个名为rank的唯一ID来区分它们：rank 0到rank 15。所有单独的进程在输入数据的不同部分并行工作，并交换它们的梯度以协同工作。

Figure 1. DL workload using two DGX-1 systems.

为了在各个塔之间进行有效的通信，使用NVIDIA集体通信库（NCCL）。NCCL实现了针对NVIDIA GPU的性能优化的多GPU和多节点集合通信原语。

训练数据的低延迟POSIX存储是通过机器人驱动的持久Volumes来实现的。大规模地处理存储是至关重要的，但不是这里的重点。

Installation steps

下面是如何在运行OpenShift v3.11的DXC机器人驱动环境中安装DGX系统。

测试系统概述

OpenShift v3.11至少需要一个临时引导计算机、三个主节点和至少两个计算节点。

因为DL可能是一个数据密集型工作负载，所以集群需要一个合适的网络解决方案。机器人驱动创新实验室提供了HPE FlexFabric 5945 32QSFP28交换机，这些交换机与DGX系统的Mellanox适配器结合使用。

DGX-1的所有集群互连适配器都以以太网模式使用，并通过使用LACP分组模式捆绑在一起。

下表总结了群集的硬件和软件配置。

Table 1. Overview of HW/SW configuration.

Preparing the DGX-1 systems

要在DGX系统上安装RHEL 7.7，请使用NVIDIA提供的安装说明。这些步骤还包括安装特定于DGX的软件。

将DGX系统连接到OpenShift群集

按照OpenShift 3.11的RedHat文档中的说明安装集群。

集群启动并运行后，使用RedHat提供的官方版本来扩展集群并包括两个DGX系统。这些执行手册添加必要的库和配置节点，并将它们添加到集群本身。

图2显示了OCP仪表板的开始显示，允许与OpenShift集群交互。这些交互包括监测资源、创建pod和检索日志。

Figure 2. Start screen of an OpenShift cluster.

Enabling GPUs in the cluster

OpenShift（和Kubernetes）都支持标准资源，比如CPU、内存和监测可用的磁盘空间。使用设备插件或操作符处理其它资源。在此设置中，将使用用于OpenShift 3.11的NVIDIA GPU设备插件。

在Kubernetes和OpenShift（v1.13+，v4.1+）的更新版本中，引入了operator框架。使用这个操作框架，NVIDIA GPU操作符允许自动部署以前必须手动部署的组件。这些组件包括NVIDIA驱动程序、用于gpu的Kubernetes设备插件、NVIDIA容器运行时、自动节点标记和基于DCGM的监视。

有关更多信息，请参阅设备插件和NVIDIA GPU运营商。

Figure 3. Multi-GPU, single-node workload example.

In the following examples, we show you how to trigger an MPI-based, DL workload using the Horovod framework. For this test, we used an MPI-enabled Docker container with the required frameworks, such as NVIDIA GPU CLOUD (NGC) containers. NGC is a GPU-optimized software hub, simplifying AI and HPC workflows.

Start the workload, then wrap the command in an OpenShift YAML for orchestration.

To run a scalable ResNet-50 training with randomly generated data natively, run the following command:

docker run --gpus all -it horovod/horovod:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6

The previous example makes use of all GPUs available on the system with the --gpus all flag. To allocate only a smaller number of GPUs, change that to --gpus 1. You can also specify individual GPUs by using their device IDs.

To start the workload in a container in interactive mode, a typical command looks like the following code example. In this example, eight GPUs are used:

horovodrun -np 8 -H localhost:8 python pytorch_synthetic_benchmark.py

A corresponding YAML file is used to start this workload through OpenShift, which can be deployed straight from OpenShift login node, OpenShift master node, or the GUI.

oc create -f horovod_example_8gpus.yaml

kubectl create -f horovod_example_8gpus.yaml

This is the content of the YAML file used:

apiVersion: v1

kind: Pod

metadata:

namespace: managed-machine-learning

spec:

serviceAccount: tensorflow-sa

restartPolicy: OnFailure

containers:

- name: horovod-test

image: horovod/horovod:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6

command: [ "horovodrun", "-np", "8", "python" , "pytorch_synthetic_benchmark.py" ]

env:

- name: NVIDIA_VISIBLE_DEVICES

value: all

- name: NVIDIA_DRIVER_CAPABILITIES

value: "compute,utility"

- name: NVIDIA_REQUIRE_CUDA

value: "cuda>=9.0"

resources:

limits:

nvidia.com/gpu: "8"

requests:

nvidia.com/gpu: "8"

securityContext:

privileged: true

To influence the scalability of the workload set the following parameters accordingly.

The command section specifies the command to be run, which is the same as used in the Docker example. The number of processes spawned can be controlled via the -np 8 flag.

command: [ "horovodrun", "-np", "8", "python" , "pytorch_synthetic_benchmark.py" ]

Allocate the proper number of GPUs in the resources section. The number should correspond to the number of processes set in the horovodrun command.

resources:

limits:

nvidia.com/gpu:"8"

requests:

nvidia.com/gpu:"8"

A pod is a group of one or more containers. Figure 4 shows the creation of pods using the web interface of OpenShift. As mentioned earlier, you can also use the CLI by either using oc or kubectl to create a new pod.

Figure 4. Created pods inside the OpenShift dashboard.

To retrieve outputs or logs of pods running in an OpenShift cluster, there are several possible ways. One way is to access the log through the CLI by using either kubectl logs, or oc logs:

kubectl logs <pod-name> -n <namespace>

oc logs <pod-name> -n <namespace>

Figure 5 shows the other possibility, using the OpenShift GUI.

Figure 5. Access to the pod output using the dashboard.

Multi-GPU, single-node workload

有几种方法可以利用多个系统的功能来获得单个DL训练的好处。多个系统的计算资源可以聚合起来以加速训练。这对于自主车辆DNN开发中的DNN训练等工作负载尤其有利，因为在大型数据集上运行此类工作负载时，实验的周转时间通常是一个关键因素。

图1显示了两个DGX系统在单个DL作业中协同工作。DL工作负载在DGX-1系统上总共使用16个GPU，每个GPU为8个。每个作业根据数据的一个分区处理其计算。

所有的工作人员通过NVIDIA NVLINK与他们的同事同步他们的技术，这是一种由NVIDIA开发的通信协议，允许在cpu和gpu或gpu之间以及网络之间传输数据和控制代码。此同步步骤显示为虚线。

重用以前的相同代码库和容器。与以前的编排方法的唯一区别是使用已知的MPI运算符。这简化了跨多个DGX系统的工作负载分配和部署。

MPI Operator

有几个选项可以在OpenShift或Kubernetes集群上运行多GPU、多节点工作负载。其中一个常见的框架是MPI操作符，由Kubeflow项目提供。MPI操作符处理DL工作负载的编排，如前所示。

安装MPI操作符时，会在集群中引入一种新的作业类型：MPIJob。下面的作业规范的代码示例显示了通过OpenShift或Kubernetes启动这种MPIJob工作负载的对应YAML文件。它可以直接从OpenShift登录节点、主节点或GUI部署。

与上一节中找到的YAML文件不同，该示例文件使用NVIDIA TensorFlow Docker映像，并在两个DGX-1系统的16个gpu上运行合成ResNet-50基准测试。使用通过NGC提供的TensorFlow Docker映像可以受益于NVIDIAs不断改进的性能。创建的三个pod（一个启动器pod和两个worker pod）之间的通信由MPI操作员负责。

与Horovod分布式模型的YAML定义一样，使用mpirun-np16的线程数再次对应于需求的gpu数量。在本例中，它是两个worker pod，每个都连接了8个gpu。通过更改worker pod的数量和生成的线程的数量，可以轻松地扩展这个示例。

16 GPU培训作业的作业规范如下所示：

apiVersion: kubeflow.org/v1alpha2

kind: MPIJob

metadata:

  name: 16gpu-tensorflow-benchmark-imagenet

spec:

  slotsPerWorker: 8

  cleanPodPolicy: Running

  mpiReplicaSpecs:

      Launcher:

      replicas: 1

      template:

     spec:

            containers:

            - image: nvcr.io/nvidia/tensorflow:19.10-py3

            name: tensorflow-benchmarks

            env:

             - name: NVIDIA_DRIVER_CAPABILITIES

               value: compute,utility

             - name: NVIDIA_REQUIRE_CUDA

               value: cuda>=9.0

            command:

            - mpirun

            - -np

            - "16"

            - -bind-to

            - none

            - -map-by

            - slot

            - -x

            - NCCL_DEBUG=INFO

            - -x

            - LD_LIBRARY_PATH

            - -x

            - PATH

            - -mca

            - pml

            - ob1

            - -mca

            - btl

            - ^openib

            - python

            - nvidia-examples/resnet50v1.5/main.py

            - --mode=training_benchmark

            - --batch_size=128

            - --num_iter=90

            - --iter_unit=epoch

            - --results_dir=/efs

      Worker:

      replicas: 2

      template:

      spec:

            containers:

            - image: nvcr.io/nvidia/tensorflow:19.10-py3

            name: tensorflow-benchmarks

            env:

             - name: NVIDIA_DRIVER_CAPABILITIES

               value: compute,utility

             - name: NVIDIA_REQUIRE_CUDA

               value: cuda>=9.0

            resources:

            limits:

                  nvidia.com/gpu: 8

On Kubernetes, the YAML file looks like the one used in OpenShift. One important difference is that environment variables in OpenShift v3.11 must be set inside the spec of the pods.

             env:

              - name: NVIDIA_DRIVER_CAPABILITIES

                value: compute,utility

              - name: NVIDIA_REQUIRE_CUDA

                value: cuda>=9.0

Aggregating cutoff resources

In large computing environments, there is often a cutoff or some unused resources. The MPI Operator is of great use in such a case, as it can aggregate those cutoffs and avoid waste. This is a property that only the MPI Operator can deliver.

The following code example shows an exemplary two-GPU job that can aggregate resources from multiple systems. The job requests two GPUs for the workload. You can customize this example to fit almost every situation, for example three GPUs on three different nodes or four GPUs on two different nodes.

apiVersion: kubeflow.org/v1alpha2

kind: MPIJob

metadata:

spec:

slotsPerWorker: 1

cleanPodPolicy: Running

mpiReplicaSpecs:

Launcher:

replicas: 1

template:

spec:

containers:

- image: mpioperator/tensorflow-benchmarks:latest

env:

- name: NVIDIA_DRIVER_CAPABILITIES

value: compute,utility

- name: NVIDIA_REQUIRE_CUDA

value: cuda>=9.0

command:

- mpirun

- -np

- "2"

- -bind-to

- none

- -map-by

- slot

- -x

- NCCL_DEBUG=INFO

- -x

- LD_LIBRARY_PATH

- -x

- PATH

- -mca

- pml

- ob1

- -mca

- btl

- ^openib

- -mca

- plm_base_verbose

- "100"

- -mca

- btl_base_verbose

- "30"

- python

- scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py

- --model=resnet101

- --batch_size=64

- --variable_update=horovod

resources:

limits:

nvidia.com/gpu: 1

Worker:

replicas: 2

template:

spec:

containers:

- image: mpioperator/tensorflow-benchmarks:latest

env:

- name: NVIDIA_DRIVER_CAPABILITIES

value: compute,utility

- name: NVIDIA_REQUIRE_CUDA

value: cuda>=9.0

resources:

limits:

nvidia.com/gpu: 1

Running this YAML file results in a log file like the following. It can either be collected using the GUI or CLI.

oc logs <pod-name> -n <namespace>

kubectl logs <pod-name> -n <namespace>

tensorflow-benchmarks-gpu-v1a2-worker-1:26:156 [0] NCCL INFO Ring 01 : 1 -> 0 [send] via NET/Socket/0

tensorflow-benchmarks-gpu-v1a2-worker-0:26:156 [0] NCCL INFO Ring 01 : 0 -> 1 [send] via NET/Socket/0

tensorflow-benchmarks-gpu-v1a2-worker-0:26:156 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled

tensorflow-benchmarks-gpu-v1a2-worker-1:26:156 [0] NCCL INFO comm 0x7f887032e120 rank 1 nranks 2 cudaDev 0 nvmlDev 7 - Init COMPLETE

tensorflow-benchmarks-gpu-v1a2-worker-0:26:156 [0] NCCL INFO comm 0x7f003434e940 rank 0 nranks 2 cudaDev 0 nvmlDev 3 - Init COMPLETE

tensorflow-benchmarks-gpu-v1a2-worker-0:26:156 [0] NCCL INFO Launch mode Parallel

Done warm up

Step Img/sec total_loss

Done warm up

Step Img/sec total_loss

1 images/sec: 147.6 +/- 0.0 (jitter = 0.0) 8.308

1 images/sec: 147.8 +/- 0.0 (jitter = 0.0) 8.378

10 images/sec: 159.1 +/- 2.3 (jitter = 4.7) 8.526

Querying for the running pods inside the cluster does show that cutoff resources from two nodes are being used.

$ oc get pods -o wide

NAME READY STATUS RESTARTS AGE IP NODE

tensorflow-benchmarks-gpu-v1a2-launcher-dqgsk 1/1 Running 0 3m45s 10.233.XXX.XXX dgx01.dev.XXX

tensorflow-benchmarks-gpu-v1a2-worker-0 1/1 Running 0 3m45s 10.233.XXX.XXX dgx02.dev.XXX

tensorflow-benchmarks-gpu-v1a2-worker-1 1/1 Running 0 3m45s 10.233.XXX.XXX dgx01.dev.XXX

Docker network configuration

NCCL is discovering the topology with its peers. The following output was taken from one of the running pods. It shows that, besides the standard local adapter, there is only one additional connection configured.

eth0 Link encap:Ethernet HWaddr 0a:58:0a:81:07:b2

inet addr:10.129.7.178 Bcast:10.129.7.255 Mask:255.255.254.0

inet6 addr: fe80::419:d8ff:fe19:1bbd/64 Scope:Link

UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1

RX packets:21275 errors:0 dropped:0 overruns:0 frame:0

TX packets:33993 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:0

RX bytes:11305584 (11.3 MB) TX bytes:4496191 (4.4 MB)

lo Link encap:Local Loopback

inet addr:127.0.0.1 Mask:255.0.0.0

inet6 addr: ::1/128 Scope:Host

UP LOOPBACK RUNNING MTU:65536 Metric:1

RX packets:28205726 errors:0 dropped:0 overruns:0 frame:0

TX packets:28205726 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:1000

RX bytes:12809324160 (12.8 GB) TX bytes:12809324160 (12.8 GB)

Having one in a Docker NIC (here, eth0) is a starting point. A more sophisticated setup may use multiple Docker network adapters.

Get started with multi-GPU, multi-node training running on OpenShift

Adopt industry state-of-the-art DL workloads and deploy them at scale today. In this post, we showed you that data parallel training making use of the MPI paradigm is highly flexible for different environments.

With this style, DL engineers of a scalable DGX system cluster get the following benefits, regardless of their orchestration software:

Scale beyond the limitations of a single node and enable DL in a larger cluster.
Prevent turning resource cutoffs in waste by aggregating leftover resources effectively.

OpenShift上的Robotic Drive容器化计算平台大规模地协调DL工作负载，包括使用NVIDIA DGX系统的多GPU、多节点作业。首先，基于视觉的ML模型能够为最先进的复杂驾驶行为任务（如运动规划）提供良好的技术基础。这些任务需要大量的探索和实验，这些都由机器人驱动创新实验来解决。

人工智能芯片与自动驾驶