【翻译】Vulkan Renderpasses

Vulkan Renderpasses

原文链接：https://gpuopen.com/vulkan-renderpasses/

Vulkan™ is a high performance, low overhead graphics API designed to allow advanced applications to drive modern GPUs to their fullest capacity. Where traditional APIs have presented an abstraction that behaves as if commands are executed immediately, Vulkan uses a model that exposes what’s really going on; that, in fact, GPUs execute commands placed in memory, potentially out of order, and where those buffers of commands may be built in parallel across many software threads. Furthermore, large pieces of interrelated state are presented to the graphics driver at the same time through state objects. This provides drivers with an opportunity to fully optimize GPU state long ahead of render time in order to maximize performance without risking stuttering and other issues associated with just-in-time optimization. The end result is lower, more consistent frame times and lower CPU overhead, meaning more CPU cycles for your application.

Vulkan™ 是一个高性能，低开销的图形API ，设计旨在允许高级应用程序最大程度的驱动现代GPUs。传统的APIs提供了一个抽象的概念，比如表现为命令会被立即执行。Vulkan使用了一个模型，暴露了到底发生了什么。事实上，GPUs执行放置在内存中的命令，可能是无序的，并且这些命令的缓冲区可能被许多软件线程并行构建。此外，通过状态对象同时向驱动程序呈现大量的相关状态。这提供驱动在渲染时间之前去完全的优化GPU状态的机会，来最大化性能且不会导致出现堵塞和其他与即时优化相关的问题。最终的结果是更低，更一致的帧时间和更低的CPU开销，意味着你的应用程序CPU循环更多。

Vulkan is derived from AMD’s trail-blazing Mantle API. AMD donated the Mantle specification, headers and other technology to Khronos Group to use as the basis of their (at the time unnamed) next-generation API. With the help of other industry players over the course of more than a year, we eventually evolved Mantle into what became Vulkan. It was a long process, and perhaps some of the most significant changes came from our members in the mobile field, who primarily use tiled architectures in their GPUs which are designed to minimize off-chip traffic to memory in an effort to save power. Among the features proposed by our mobile members was the renderpass — an object designed to allow an application to communicate the high-level structure of a frame to the driver. Tiling GPU drivers can use this information to determine when to bring data on and off chip, whether or not to flush data out to memory or discard the content of tile buffers and even to do things like size memory allocations used for binning and other internal operations. This is a feature that Mantle did not have, and is not part of Direct3D® 12 either.

Vulkan来自于AMD的具有开拓精神的Mantle API。AMD将Mantle规范、header和其他技术贡献给Khronos Group，作为其（当时未命名的）下一代API的基础。一年多的时间里，在其他行业参与者的帮助下，我们最终将Mantle逐步发展成Vulkan。这是一个漫长的过程，也许一些最重要的变化来自于我们在移动领域的成员，他们主要在他们的GPUs 上使用了tiled的架构，设计用来最小化内存的芯片外的流量，以节省能源。我们的移动成员提出的特性之一是renderpass——一个设计用来允许应用程序将一帧的高级结构传递给驱动程序的对象。Tiling GPU驱动程序可以使用这些信息来确定何时将数据传输到芯片上，无论是否要将数据刷新到内存中，或者丢弃tile缓冲区的内容，甚至可以做诸如用于binning和其他内部操作的大小内存分配。这是Mantle没有的特性，也不是Direct3D® 12的一部分。

To Tile or Not To Tile

A tiled GPU will batch up geometry, determine which parts of the framebuffer that geometry lands in, and then for each region of the framebuffer, render the parts of geometry that hit that tile. This makes framebuffer access very coherent and in many cases, can allow the GPU to complete rendering of one framebuffer tile entirely on-chip before moving to the next. AMD does not make tiling GPUs. Our GPUs are what is known as forward or immediate renderers. This means that when a command comes in to draw some geometry, the GPU will render it to wherever it lands and complete processing it before moving on to the next command. Things are pipelined, and commands can overlap and even finish out of order, but there is special hardware built into the GPU to get things back into the right order before any data is written to memory. Our drivers don’t generally need to worry about this. So, what do these renderpass objects have to do with us? Why do we care?

一个tiled结构的GPU会批量处理几何体，决定几何体位于帧缓冲区的哪个部分。并且之后对于帧缓冲区的每一个区域，渲染击中tile的部分几何体。这使得帧缓冲区的访问非常的一直，并且在大多数情况下，可以允许GPU在移动到下一个之前在芯片上完全渲染一整个帧缓冲tile。AMD不做的tiling GPUs。我们的GPUs是以正向或立即渲染器出名的。这意味着当一个命令进入去绘制一些几何体，GPU会将它渲染到任意地方并且在移动到下一个命令前处理完成。事情是管线处理的，命令可以覆盖并且甚至无序结束，但是GPU内置了一个特殊的硬件在任何数据被写入的内存之前以正确的顺序取回。我们的驱动一般不需要担心这个。所以，这些renderpass对象与我们有什么关系？为什么我们需要在意？

In Vulkan, a renderpass object contains the structure of the frame. In its simplest form, a renderpass encapsulates the set of framebuffer attachments, basic information about pipeline state and not much more. However, a renderpass can contain one or more subpasses and information about how those subpasses relate to one another. This is where things get interesting.

在Vulkan中，一个renderpass对象包含帧的结构。在它最简单的形式中，一个renderpass封装了帧缓冲附件的集合，里面是关于管线状态基础的信息，更多的没有。然而，一个renderPass可以包含一个或多个subpasses，以及关于那些subpasses如何关联到另一个的信息。这是事情变得有趣的地方。

Each subpass can reference a subset of the framebuffer attachments for writing and also a subset of the framebuffer attachments for reading. These readable framebuffer attachments are known as input attachments and effectively contain the result of an earlier subpass at the same pixel. Unlike traditional render-to-texture techniques, where each pass may read any pixel produced by a previous pass, input attachments guarantee that each fragment shader only accesses data produced by shader invocations at the same pixel. Further, each subpass contains information about what to do with each of the attachments when it begins (clear it, restore it from memory, or leave it uninitialized) and what to do with the attachments when it ends (store it back to memory or throw it away). The dependencies between the subpasses are explicitly spelled out by the application. This allows a tiled renderer to know, exactly, when it needs to flush its tile buffer, clear it, restore it from memory, and so on.

每一个subpass可以引用一个帧缓冲附件的子集用于写，并且同样有一个子集用于读。这些可读的帧缓冲附件被称为输入附件，有效的包含了一个在同一像素上更早的subpass的结果。不像传统的渲染到纹理的技术，每一个pass可以读取之前的一个pass产生的任何像素，输入附件保证了每一个片段着色器只能访问同一像素着色器调用生成的数据。此外，每一个subpass包含了如何处理每一个附件何时开始（清除，从内存中恢复，或未初始化），以及当它结束时如何处理（将其存入内存或丢弃）的信息。subpasses之间的依赖是由应用程序显式说明的。这允许一个平tiled的渲染器准确地知道，它需要何时对它的tile缓冲区进行刷新，清除，存回内存等等操作。

Go Forward Faster

As it turns out, a forward renderer such as ours can take advantage of this kind of information as well. Here are a few examples of the types of optimizations we can make.

事实证明，像我们的正向渲染器的也可以从这种消息中获得好处。下面是一些我们可以进行优化类型的案例。

Just as we can tell that one subpass depends on the result of an earlier one, we can tell when a subpass does not depend on an earlier one. Therefore, we can sometimes render those subpasses in parallel or even out of order without synchronization. If one subpass depends on the result of a previous subpass, then with a traditional graphics API, the driver would need to inject a bubble into the GPU pipeline in order to synchronize the render backends’ output caches with the texture units’ input caches. However, by rescheduling work, we can instruct the render backends to flush their caches, process unrelated work and then invalidate the texture caches before initiating the consuming subpass. This eliminates the bubble and saves GPU time.

正如我们所了解的，一个subpass依赖于一个更早的结果，我们可以判断何时一个subpass没有依赖于更早的一个。因此，我们可以有时并行或者没有同步的无序渲染那些subpass。如果一个subpass依赖于更早的一个subpass的结果，那么使用传统的图形API，驱动需要在图形管线中注入一个气泡为了同步渲染后面的输出缓冲和纹理单元输入缓存。然而，通过重新排序工作，我们可以指示渲染后方去刷新他们的缓存，处理不相关的工作，并且之后在启动消耗的subpass之前无效化纹理缓存。这样做消除了气泡并且节省了GPU时间。

Because each subpass includes information about what to do with its attachments, we can tell that an application is going to clear an attachment, or that it doesn’t care about the content of that attachment. This allows the driver to schedule clears far ahead of real rendering work, or to intelligently decide what method to use to clear an attachment (such as using a compute shader, fixed function hardware or a DMA engine). If an application says that it doesn’t need an attachment to have defined data, we can bring the attachment into a partially compressed state. This is where the data it contains is undefined but its state, as far as the hardware is concerned is optimal for rendering.

因为每一个subpass包括关于如何处理附件的信息，我们可以判断一个应用程序打算清理一个附件，或者它不在意附件的内容。这允许驱动在实时渲染工作之前安排清理，或者去聪明的决定用什么方法去清理一个附件（例如使用计算着色器，固定功能硬件或者一个DMA引擎）。如果一个应用程序说它不需要一个附件去定义数据，我们可以将附件进入部分封装的状态。这是数据包含的未定义的地方，但是它的状态，对于硬件来说最适合渲染。

In some cases, the in-memory layout of data is different for optimal rendering and reading via the texture units. By analyzing the data dependencies that an application provides, our drivers can decide when it is best to perform layout changes, decompression, format conversion and so on. It can also split some of these operations into phases, interleaving them with application-supplied rendering work, which again, eliminates pipeline bubbles and improves efficiency.

在某些情况下，数据的内存布局与通过纹理单元最佳的渲染和读取不同。通过分析应用程序提供的数据依赖，我们的驱动可以决定进行更改布局，解压缩，格式转换等的最佳时间。它还可以将其中的一些操作分成几个阶段，与应用程序提供的渲染工作分离，这同样可以消除管线的气泡并提升了效率。

Finally, Vulkan includes the concept of transient attachments. These are framebuffer attachments that begin in an uninitialized or cleared state at the beginning of a renderpass, are written by one or more subpasses, consumed by one or more subpasses and are ultimately discarded at the end of the renderpass. In this scenario, the data in the attachments only lives within the renderpass and never needs to be written to main memory. Although we’ll still allocate memory for such an attachment, the data may never leave the GPU, instead only ever living in cache. This saves bandwidth, reduces latency and improves power efficiency.

最后，Vulkan包括了暂时附件的内容。这些都是在renderpass的开始，始于未初始化或清理状态的帧缓冲附件，并且由一个或更多的subpass写入和消耗，最终在renderpass的末尾被彻底抛弃。在这个情境中，附件中的数据只存在于renderpass中，不需要写入主内存中。尽管我们依然为这样的附件分配内存，但是数据不会离开GPU，而只是存在与缓存中。这节约了带宽，减少了延迟并提升了能源效率。

A First-Class Feature

Renderpasses should not be seen as a “mobile-only” feature. This is a first class feature of the Vulkan API and one which presents a lot of opportunities for optimization and efficiency on the GPU, even for forward, immediate renderers such as the GCN architecture. Our initial early-look drivers include a renderpass compiler which already performs some of the optimizations outlined above. We have a laundry list of experiments to perform and we’ll be bringing more and more features online over the coming months. Simply combining a couple of passes together into a single subpass probably won’t yield much improvement. However, getting as much of your frame inside as few renderpass objects as possible is potentially going to be a huge win in both software and hardware performance.

Renderpass不应该被视为一个“仅限移动平台”的特性。这是Vulkan API的第一个类特性，它提供了许多GPU上优化和效率的机会，甚至对于正向的，立即渲染器例如GCN架构，我们最初的早期驱动包含了一个renderpass编译器已经能够执行上面概述的一些优化。我们有一系列的尝试去做，接下来的几个月我们会把越来越多的功能带到网上。简单的将几个pass合并在到一个单独的subpass上可能不会改进太多。然而，你的帧尽可能多的包含尽可能少的renderpass对象是可能在软件和硬件性能上获得一个巨大的成功。

Vulkan on GPUOpen

Other Vulkan related blogs on GPUOpen

Graham Sellers is a Fellow Software Architect at AMD working on graphics drivers. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.

【翻译】Vulkan Renderpasses

Vulkan Renderpasses

To Tile or Not To Tile

Go Forward Faster

A First-Class Feature

More