Tile based Rendering //后面一段是手机优化建议

https://www.imgtec.com/blog/a-look-at-the-powervr-graphics-architecture-tile-based-rendering/

一种硬件结构

color target 分成tile

减小带宽

提前（fs）用depth做隐藏面消除 earlyz一个意思

减小cache missing 一行短了。。

所以early失效的都不可以 fs 改depth那些操作

比如fs里面discard (mask or alpha test) alpha to coverage

会不走onchip depth而访存拿depth

要clear 不然就少一次往tilebuffer上存上帧内容的操作

========================================

http://aras-p.info/texts/files/FastMobileShaders_siggraph2011.pdf

这段优化策略是2011年的很多东西变了比如ETC2 比如

tiled deferred PowerVR

Tiled Mali, Andreno

Immediate Tegra

1) TBDR: Render everything in tiles, shade only visible pixels

2) Tiled: Render everything in tiles

3) Classic: Render everything

因为分tile sample的时候cache missing会比一张大的frame buffer下降

这样mipmap就没那么那么(对性能的影响) 要紧了但也是好的（对表现的影响走样）

贴图资源分平台压缩

PVRTC for PowerVR; DXT for Tegra; ATC for Adreno

ETC2 for Android ogles3.0

TBDR:ipad2

msaa cheaper than immediate

2-4ms 4xmsaa

aniso 3ms

aniso=2

关了mipmap ipad2 2-3ms下降

tegra 跪了

TBDR不存在每个draw call的gpu时间了，这样拿不到GPU时间不利于做优化

Andero和Tegra还有

一帧的VB太大会被切导致效率下降（一次处理不了分两次） 1000 thound vertex ipd2

=====================================

减小overdraw of alpha blend

PerfHUD profiler ES

============================

优化示例 tegra

天空盒后画

opaque 从前往后（不太现实需要polygon粒度的排序排序）

近的大的物体按这个方式排序远的按照material分合并批次减少renderstate切换

（太有才了，我之前只考虑到这两点是矛盾的没有想到可以分远近使用这两种策略）

主角先画敌人在场景之后画（被遮挡）

因为reject occlude geo在tegra2上cost1ms*（vs）我们可以设置trigger zone 这里关掉skybox 这样vs也去了

排序opaque带来 15ms提升

------

shader优化

shader指标 cycles/pixel 有静态分析工具见别的帖子

light in lookup texture--LUT

by tex2d（N.L,N.H） (我之前用过一张beckmann的)

----------

texture 压缩硬件支持的格式直接sample了

工具

IOS+PowerVR

unity profiler

Apple Instruments

PowerVR 他家有一套工具见官网PVRUniSCo shader analyzer 可以看cycle

Android +Tegra

nv PerfHUD ES

每个drawcall的gpu时间

shader cycle

2x2 texture， null view rectangle 这两个排除很好用

虽然作者很喜欢这个，感觉这个东西需要开发工具箱那种实体设备不太方便的样子

Mali

Andreno 都有他们家自己的工具

抓帧

shader 分析，live editing（这个功能我很喜欢）

我用Snapdragon比较多

最近renderdoc 也出了android版本还算好用

============================================

shader优化浮点数精度

float/half/fixed 对应highp/mediump/lowp

不要相信直觉

lowp 8bit -2.0--+2.0

存颜色归一化的vector 不要缩放拆解 lowp

mediump 16bit uv， 2d vector 不需要高精度的量

highp 24-32bit 看平台

世界坐标，标量，大贴图UV 对精度要求比较高的offset之类

这个精度的事情分平台有的显卡对精度比较敏感总之看操作手册

===============================

Likewise, do not pack 2 UVs into one float4/vec4 varying for PowerVR

float4 uv -----uv.xy ux.zw, povwerVR里面不要这样用

变量和插值

变量开销分平台看手册

andreno对shader comple没那么敏感

==============

下面一个例子是ios优化

glFInish wait 这个可以看gpu时间 profiler 看cpu wait了多久

后处理 bloom和热扭曲花了10ms+

浮点数精度合并热扭曲和bloom 减少一次blit

优化了10ms （这个我也会我减了两次blit在ppv2 也是10ms+）

它有个到处都用的fire wall shader

判断ALU bound 还是Texture bound

ALU bound

浮点数精度逐顶点计算 lookup light tex

用工具分析shader PVRUniSCo

减小顶点数量导致scene split了 3ms（Apple’s Instruments show this）

粒子优化减小overdraw 简化shader

省出来的budget给了msaa和aniso

======================

tbdr

• Hidden Surface Removal

– For opaque only

– Don’t keep alpha-test enabled all the time（少用，用的时候才开）

– Don’t keep “discard” keyword in shader source, even if it’s not used（没用的discard去掉）

• Group opaque drawcalls together

• Sort on state, not distance

============================

枭龙优化建议

Qualcomm Snapdragon Rendering Tips

• Traditional handling of overdraw (via depth test)

– Cull as much as you can on CPU, to avoid both CPU and GPU cost

– Sort on distance (front to back) to maximize early z-rejection

• The Adreno SIMD is wide

– Check your ALU utilization in the Adreno Profiler and optimize

– Minimize temp register usage

– Use long shaders with a lot of ALU instructions

– Avoid dependent texture fetches (or cover the latency with a lot of ALUs)

==================

FBO和tile 切换很费需要frame buffer存到 memory

Expensive to switch Frame Buffer Object on Tile-based GPUs

– Saves the current FBO to RAM

– Reloads the new FBO from RAM

带宽高

Framebuffer Resolve/Restore

• Clear ALL FBO attachments after new frame/rendertarget

– Clear after eglSwapBuffers / glBindFramebuffer

– Avoids reloading FBO from RAM

– NOTE: Do NOT do unnecessary clears on non-tile-based GPUs (e.g. NVIDIA)

• Discard unused attachments before new frame/rendertarget

– Discard before eglSwapBuffers / glBindFramebuffer

– Avoids saving unused FBO attachments to RAM

– glDiscardFramebufferEXT / glInvalidateFramebuffer

这些都是为了防止从memory读写framebuffer

=============================================================

https://de45xmedrsdbp.cloudfront.net/Resources/files/GDC2014_Next_Generation_Mobile_Rendering-2033767592.pdf

tbr和tbdr的差异在于 hsr，顶点处理都是一样的 vs vs vs ps这样

但tbdr只把最前面那层送ps 他做了排序

tbr是送了全部数据所以受困于overdraw