A Full Hardware Guide to Deep Learning深度学习电脑配置

https://study.163.com/provider/400000000398149/index.htm?share=2&shareId=400000000398149（欢迎关注博主主页，学习python视频资源，还有大量免费python经典文章）

https://timdettmers.com/2018/12/16/deep-learning-hardware-guide/

深度学习的完整硬件指南

深度学习是计算密集型的，因此您需要具有多个内核的快速CPU，对吧？或者购买快速CPU是否浪费？在构建深度学习系统时，您可以做的最糟糕的事情之一就是在不必要的硬件上浪费金钱。在这里，我将逐步指导您使用廉价高性能系统所需的硬件。

多年来，我总共建立了7个不同的深度学习工作站，尽管经过仔细的研究和推理，但我在选择硬件部件方面犯了很大的错误。在本指南中，我想分享一下我多年来积累的经验，这样你就不会犯同样的错误。

博客帖子按错误严重程度排序。这意味着人们通常浪费最多钱的错误首先出现。

GPU

这篇博文假设您将使用GPU进行深度学习。如果您正在构建或升级系统以进行深度学习，那么忽略GPU是不明智的。GPU只是深度学习应用程序的核心 - 处理速度的提高太大了，不容忽视。

我在GPU推荐博客文章中详细讨论了GPU的选择，而GPU的选择可能是深度学习系统最关键的选择。选择GPU时可能会遇到三个主要错误：（1）成本/性能不佳，（2）内存不足，（3）散热不良。

为了获得良好的性价比，我建议使用RTX 2070或RTX 2080 Ti。如果使用这些卡，则应使用16位模型。否则，来自eBay的GTX 1070，GTX 1080，GTX 1070 Ti和GTX 1080 Ti是公平的选择，您可以使用这些具有32位（但不是16位）的GPU。

选择GPU时要小心内存要求。可以运行16位的RTX卡可以训练相比GTX卡具有相同内存大两倍的型号。因此，RTX卡具有内存优势，并且选择RTX卡并学习如何有效地使用16位模型将带您走很长的路。通常，对内存的要求大致如下：

正在寻找最先进分数的研究：> = 11 GB
正在寻找有趣架构的研究：> = 8 GB
任何其他研究：8 GB
Kaggle：4 - 8 GB
启动：8 GB（但检查特定应用领域的型号尺寸）
公司：8 GB用于原型设计，> = 11 GB用于培训

需要注意的另一个问题是，特别是如果你购买多个RTX卡就是冷却。如果您想将GPU固定在彼此相邻的PCIe插槽中，您应该确保使用鼓风机式风扇获得GPU。否则，您可能会遇到温度问题，并且您的GPU速度会变慢（约30％）并且死得更快。

怀疑阵容
您能否识别出因性能不佳而出现故障的硬件部分？其中一个GPU？或者也许这毕竟是CPU的错？

内存

RAM的主要错误是购买时钟频率过高的RAM。第二个错误是购买不够的RAM以获得平滑的原型制作体验。

需要的RAM时钟速率

RAM时钟速率是市场营销的一种情况，RAM公司会引诱你购买“更快”的RAM，实际上几乎没有产生任何性能提升。最好的解释是“ RAM速度真的很重要吗？“关于RAM von Linus技术提示的视频。

此外，重要的是要知道RAM速度与快速CPU RAM-> GPU RAM传输几乎无关。这是因为（1）如果您使用了固定内存，您的迷你批次将被转移到GPU而不需要CPU的参与，以及（2）如果您不使用固定内存，快速与慢速RAM的性能提升是关于0-3％ - 把钱花在别的地方！

RAM大小

RAM大小不会影响深度学习性能。但是，它可能会阻碍您轻松执行GPU代码（无需交换到磁盘）。你应该有足够的内存来舒适地使用你的GPU。这意味着您应该至少拥有与最大GPU匹配的RAM量。例如，如果你有一个24 GB内存的Titan RTX，你应该至少有24 GB的RAM。但是，如果您有更多的GPU，则不一定需要更多RAM。

这种“在RAM中匹配最大GPU内存”策略的问题在于，如果处理大型数据集，您可能仍然无法使用RAM。这里最好的策略是匹配你的GPU，如果你觉得你没有足够的RAM，只需再购买一些。

一种不同的策略受到心理学的影响：心理学告诉我们，注意力是一种随着时间推移而耗尽的资源。RAM是为数不多的硬件之一，可以让您节省集中资源，解决更困难的编程问题。如果你有更多的RAM，你可以将注意力集中在更紧迫的事情上，而不是花费大量时间来环绕RAM瓶颈。有了大量的RAM，您可以避免这些瓶颈，节省时间并提高生产率，解决更紧迫的问题。特别是在Kaggle比赛中，我发现额外的RAM对于特征工程非常有用。因此，如果您有钱并进行大量预处理，那么额外的RAM可能是一个不错的选择。因此，使用此策略，您希望现在拥有更多，更便宜的RAM而不是更晚。

中央处理器

人们犯的主要错误是人们过分关注CPU的PCIe通道。您不应该太在意PCIe通道。相反，只需查看您的CPU和主板组合是否支持您要运行的GPU数量。第二个最常见的错误是获得一个功能太强大的CPU。

CPU和PCI-Express

人们对PCIe车道感到疯狂！然而，问题是它对深度学习表现几乎没有影响。如果您只有一个GPU，则只需要PCIe通道即可快速将数据从CPU RAM传输到GPU RAM。然而，ImageNet批次的32个图像（32x225x225x3）和32位需要1.1毫秒，16个通道，2.3毫秒，8个通道，4.5毫秒，4个通道。这些是理论数字，在实践中，你经常看到PCIe的速度是它的两倍 - 但这仍然是闪电般的快速！PCIe通道通常具有纳秒范围内的延迟，因此可以忽略延迟。

把这个放在一起我们有一个ImageNet迷你批次的32张图像和一个ResNet-152以下时间：

前向和后向传递：216毫秒（ms）
16个PCIe通道CPU-> GPU传输：大约2 ms（理论上为1.1 ms）
8个PCIe通道CPU-> GPU传输：大约5毫秒（2.3毫秒）
4个PCIe通道CPU-> GPU传输：大约9毫秒（4.5毫秒）

因此，从4到16个PCIe通道将使性能提升约3.2％。但是，如果你使用带固定内存的PyTorch数据加载器，你可以获得0％的性能。因此，如果您使用单个GPU，请不要在PCIe通道上浪费金钱！

选择CPU PCIe通道和主板PCIe通道时，请确保选择支持所需GPU数量的组合。如果您购买的主板支持2个GPU，并且您希望最终拥有2个GPU，请确保购买支持2个GPU的CPU，但不一定要查看PCIe通道。

PCIe通道和多GPU并行

如果您在具有数据并行性的多个GPU上训练网络，PCIe通道是否重要？我已经在ICLR2016上发表了一篇论文，我可以告诉你，如果你有96个GPU，那么PCIe通道非常重要。但是，如果您有4个或更少的GPU，这并不重要。如果您在2-3个GPU之间并行化，我根本不关心PCIe通道。有了4个GPU，我确保每个GPU可以获得8个PCIe通道的支持（总共32个PCIe通道）。因为几乎没有人运行超过4个GPU的系统作为经验法则：不要花费额外的钱来获得每GPU更多的PCIe通道 - 这没关系！

需要CPU核心

为了能够为CPU做出明智的选择，我们首先需要了解CPU以及它与深度学习的关系。CPU为深度学习做了什么？当您在GPU上运行深网时，CPU几乎不会进行任何计算。主要是它（1）启动GPU函数调用，（2）执行CPU函数。

到目前为止，CPU最有用的应用程序是数据预处理。有两种不同的常见数据处理策略，它们具有不同的CPU需求。

第一个策略是在训练时进行预处理：

环：

加载小批量
预处理小批量
小批量训练

第二个策略是在任何培训之前进行预处理：

预处理数据
环：
1. 加载预处理的小批量
2. 小批量训练

对于第一个策略，具有多个内核的良好CPU可以显着提高性能。对于第二种策略，您不需要非常好的CPU。对于第一个策略，我建议每个GPU至少有4个线程 - 通常每个GPU有两个核心。我没有对此进行过硬测试，但每增加一个核心/ GPU，你应该获得大约0-5％的额外性能。

对于第二种策略，我建议每个GPU至少有2个线程 - 通常是每个GPU一个核心。如果您使用第二个策略，当您拥有更多内核时，您将不会看到性能的显着提升。

需要的CPU时钟频率（频率）

当人们考虑快速CPU时，他们通常首先考虑时钟频率。4GHz优于3.5GHz，还是它？这对于比较具有相同架构的处理器（例如“Ivy Bridge”）通常是正确的，但它在处理器之间不能很好地比较。此外，它并不总是衡量性能的最佳方法。

在深度学习的情况下，CPU几乎没有计算：在这里增加一些变量，在那里评估一些布尔表达式，在GPU或程序内进行一些函数调用 - 所有这些都取决于CPU核心时钟率。

虽然这种推理似乎很合理，但是当我运行深度学习程序时，CPU有100％的使用率，那么这里的问题是什么？我做了一些CPU核心速率的低频实验来找出答案。

MNIST和ImageNet上的CPU降频：性能测量为200个历元MNIST或ImageNet上具有不同CPU核心时钟速率的四分之一时间，其中最大时钟速率被视为每个CPU的基线。作为比较：从GTX 680升级到GTX Titan的性能约为+ 15％; 从GTX Titan到GTX 980另外+ 20％的性能; GPU超频可为任何GPU带来约+ 5％的性能

请注意，这些实验是在过时的硬件上进行的，但是，对于现代CPU / GPU，这些结果应该仍然相同。

硬盘/ SSD

硬盘通常不是深度学习的瓶颈。但是，如果你做了愚蠢的事情会对你造成伤害：如果你在需要时从磁盘读取你的数据（阻塞等待），那么一个100 MB / s的硬盘驱动器将花费你大约185毫秒的时间为32的ImageNet迷你批次 - 哎哟！但是，如果您在使用数据之前异步获取数据（例如火炬视觉加载器），那么您将在185毫秒内加载小批量，而ImageNet上大多数深度神经网络的计算时间约为200毫秒。因此，在当前仍处于计算状态时加载下一个小批量，您将不会面临任何性能损失。

但是，我建议使用SSD以提高舒适度和工作效率：程序启动和响应速度更快，使用大文件进行预处理要快得多。如果您购买NVMe SSD，与普通SSD相比，您将获得更加平滑的体验。

因此，理想的设置是为数据集和SSD提供大而慢的硬盘驱动器，以提高生产率和舒适度。

电源装置（PSU）

通常，您需要一个足以容纳所有未来GPU的PSU。GPU随着时间的推移通常会变得更加节能; 因此，虽然其他组件需要更换，但PSU应该持续很长时间，因此良好的PSU是一项很好的投资。

您可以通过将CPU和GPU的功耗与其他组件的额外10％瓦特相加来计算所需的功率，并作为功率峰值的缓冲器。例如，如果您有4个GPU，每个250瓦TDP和一个150瓦TDP的CPU，那么您将需要一个最小为4×250 + 150 + 100 = 1250瓦的PSU。我通常会添加另外10％，以确保一切正常，在这种情况下将导致总共1375瓦特。在这种情况下，我会得到一个1400瓦的PSU。

需要注意的一个重要部分是，即使PSU具有所需的功率，它也可能没有足够的PCIe 8针或6针连接器。确保PSU上有足够的连接器以支持所有GPU！

另一个重要的事情是购买具有高功率效率的PSU - 特别是如果你运行许多GPU并将运行它们更长的时间。

以全功率（1000-1500瓦）运行4 GPU系统来训练卷积网两周将达到300-500千瓦时，在德国 - 相当高的电力成本为每千瓦时20美分 - 将达到60- 100欧元（66-111美元）。如果这个价格是100％的效率，那么用80％的电源进行这样的网络培训会增加18-26欧元的成本 - 哎哟！对于单个GPU而言，这一点要少得多，但重点仍然存在 - 在高效电源上投入更多资金是有道理的。

全天候使用几个GPU将大大增加您的碳足迹，它将使运输（主要是飞机）和其他有助于您的足迹的因素蒙上阴影。如果你想要负责，请考虑像纽约大学机器学习语言组（ML2）那样实现碳中性 - 它很容易做到，价格便宜，应该是深度学习研究人员的标准。

CPU和GPU冷却

冷却很重要，它可能是一个重要的瓶颈，它会比糟糕的硬件选择降低性能。对于CPU来说，使用标准散热器或一体化（AIO）水冷却解决方案应该没问题，但是对于GPU来说，需要特别注意。

风冷GPU

对于单个GPU，空气冷却是安全可靠的，或者如果您有多个GPU之间有空间（在3-4 GPU情况下为2个GPU）。但是，当您尝试冷却3-4个GPU时，可能会出现一个最大的错误，在这种情况下您需要仔细考虑您的选项。

现代GPU在运行算法时会将速度 - 以及功耗 - 提高到最大值，但一旦GPU达到温度障碍 - 通常为80°C - GPU将降低速度，以便温度阈值为没有违反。这样可以在保持GPU过热的同时实现最佳性能。

然而，对于深度学习程序而言，典型的风扇速度预编程时间表设计得很糟糕，因此在开始深度学习程序后几秒内就达到了这个温度阈值。结果是性能下降（0-10％），这对于GPU相互加热的多个GPU（10-25％）而言可能很重要。

由于NVIDIA GPU首先是游戏GPU，因此它们针对Windows进行了优化。您可以在Windows中单击几次更改风扇计划，但在Linux中不是这样，并且因为大多数深度学习库都是针对Linux编写的，这是一个问题。

Linux下唯一的选择是用于设置Xorg服务器（Ubuntu）的配置，您可以在其中设置“coolbits”选项。这对于单个GPU非常有效，但是如果你有多个GPU，其中一些是无头的，即它们没有连接监视器，你必须模拟一个硬和黑客的监视器。我试了很长时间，并且使用实时启动CD来恢复我的图形设置令人沮丧 - 我无法让它在无头GPU上正常运行。

如果在空气冷却下运行3-4个GPU，最重要的考虑因素是注意风扇设计。“鼓风机”风扇设计将空气推出到机箱背面，以便将新鲜，凉爽的空气推入GPU。非鼓风机风扇在GPU的虚拟性中吸入空气并冷却GPU。但是，如果你有多个相邻的GPU，那么周围没有冷空气，带有非鼓风机风扇的GPU会越来越多地加热，直到它们自己降低温度到达更低的温度。不惜一切代价避免3-4个GPU设置中的非鼓风机风扇。

用于多个GPU的水冷GPU

另一种更昂贵且更加工艺的选择是使用水冷却。如果您使用单个GPU或两个GPU之间有空间（3-4 GPU板中有2个GPU），我不建议使用水冷。然而，水冷却确保即使最强劲的GPU在4 GPU设置中也能保持凉爽，这在用空气冷却时是不可能的。水冷却的另一个优点是它可以更安静地运行，如果您在其他人工作的区域运行多个GPU，这是一个很大的优势。水冷却每个GPU花费大约100美元，还有一些额外的前期成本（大约50美元）。水冷还需要一些额外的工作来组装你的计算机，但有很多详细的指南，它应该只需要几个小时的时间。维护不应该那么复杂或费力。

冷却的大案例？

我为我的深度学习集群购买了大型塔，因为他们为GPU领域增加了粉丝，但我发现这在很大程度上是不相关的：大约2-5°C的降低，不值得投资和案件的庞大。最重要的部分是直接在GPU上的冷却解决方案 - 不要为其GPU冷却功能选择昂贵的外壳。在这里便宜。这个案子应该适合你的GPU，但就是这样！

结论冷却

所以最后它很简单：对于1 GPU，空气冷却是最好的。对于多个GPU，您应该获得鼓风式空气冷却并接受微小的性能损失（10-15％），或者您需要额外支付水冷却，这也更难以正确设置并且您没有性能损失。在某些情况下，空气和水冷却都是合理的选择。然而，我会建议空气冷却以简化 - 如果您运行多个GPU，请获得鼓风式GPU。如果您想用水冷却，请尝试为GPU找到一体化（AIO）水冷却解决方案。

母板

您的主板应该有足够的PCIe端口来支持您要运行的GPU数量（通常限制为4个GPU，即使您有更多的PCIe插槽）; 请记住，大多数GPU的宽度为两个PCIe插槽，因此如果您打算使用多个GPU，请购买PCIe插槽之间有足够空间的主板。确保您的主板不仅具有PCIe插槽，而且实际上支持您要运行的GPU设置。如果您在newegg上搜索您选择的主板并查看规格页面上的PCIe部分，通常可以在此找到相关信息。

电脑机箱

选择外壳时，应确保它支持位于主板顶部的全长GPU。大多数情况下都支持全长GPU，但是如果你购买一个小盒子，你应该怀疑。检查其尺寸和规格; 你也可以尝试谷歌图像搜索该模型，看看你是否找到了带有GPU的图片。

如果您使用自定义水冷却，请确保您的外壳有足够的空间放置散热器。如果您为GPU使用水冷却，则尤其如此。每个GPU的散热器都需要一些空间 - 确保您的设置实际上适合GPU。

显示器

我首先想到关于显示器也是愚蠢的，但它们会产生如此巨大的差异并且非常重要，我只需要写下它们。

我在3台27英寸显示器上花的钱可能是我用过的最好的钱。使用多台显示器时，生产力会大幅提升。如果我必须使用一台显示器，我会感到非常瘫痪。不要在这件事上做出改变。如果您无法以有效的方式操作它，那么快速深度学习系统有什么用呢？

我深入学习时的典型显示器布局：左：论文，谷歌搜索，gmail，stackoverflow; 中：代码; right：输出窗口，R，文件夹，系统监视器，GPU监视器，待办事项列表和其他小型应用程序。

关于构建PC的一些话

许多人害怕建造电脑。硬件组件很昂贵，你不想做错事。但它非常简单，因为不属于一起的组件不能组合在一起。主板手册通常非常具体如何组装所有内容，并且有大量的指南和分步视频，如果您没有经验，它们将指导您完成整个过程。

构建计算机的好处在于，您知道有关构建计算机的所有知识，因为所有计算机都以相同的方式构建 - 因此构建计算机将成为您的生活技能将能够一次又一次地申请。所以没有理由退缩！

结论/ TL; DR

GPU：RTX 2070或RTX 2080 Ti。来自eBay的GTX 1070，GTX 1080，GTX 1070 Ti和GTX 1080 Ti也不错！
CPU：每GPU 1-2个核心，具体取决于您预处理数据的方式。> 2GHz; CPU应该支持您要运行的GPU数量。PCIe通道并不重要。

RAM：
- 时钟频率无关紧要 - 购买最便宜的RAM。
- 购买至少与最大GPU的RAM相匹配的CPU RAM。
- 仅在需要时购买更多RAM。
- 如果您经常使用大型数据集，则可以使用更多RAM。

硬盘/ SSD：
- 用于数据的硬盘驱动器（> = 3TB）
- 使用SSD来提供舒适性并预处理小型数据集。

PSU：
- 加上GPU + CPU的瓦数。然后将所需功率的总和乘以110％。
- 如果您使用多个GPU，请获得高效率。
- 确保PSU有足够的PCIe连接器（6 + 8针）

散热：
- CPU：获得标准CPU散热器或一体化（AIO）水冷解决方案
- GPU：
- 使用空气冷却
- 如果您购买多个GPU
，请使用“鼓风式”风扇获取GPU - 在您的Xorg中设置coolbits标志配置控制风扇速度

主板：
- 为您（未来）的GPU获取尽可能多的PCIe插槽（一个GPU需要两个插槽;每个系统最多4个GPU）

监视器：
- 额外的监视器可能会比其他GPU更高效。

Deep Learning is very computationally intensive, so you will need a fast CPU with many cores, right? Or is it maybe wasteful to buy a fast CPU? One of the worst things you can do when building a deep learning system is to waste money on hardware that is unnecessary. Here I will guide you step by step through the hardware you will need for a cheap high-performance system.

Over the years, I build a total of 7 different deep learning workstations and despite careful research and reasoning, I made my fair share of mistake in selecting hardware parts. In this guide, I want to share my experience that I gained over the years so that you do not make the same mistakes that I did before.

The blog post is ordered by mistake severity. This means the mistakes where people usually waste the most money come first.

GPU

This blog post assumes that you will use a GPU for deep learning. If you are building or upgrading your system for deep learning, it is not sensible to leave out the GPU. The GPU is just the heart of deep learning applications – the improvement in processing speed is just too huge to ignore.

I talked at length about GPU choice in my GPU recommendations blog post, and the choice of your GPU is probably the most critical choice for your deep learning system. There are three main mistakes that you can make when choosing a GPU: (1) bad cost/performance, (2) not enough memory, (3) poor cooling.

For good cost/performance, I generally recommend an RTX 2070 or an RTX 2080 Ti. If you use these cards you should use 16-bit models. Otherwise, GTX 1070, GTX 1080, GTX 1070 Ti, and GTX 1080 Ti from eBay are fair choices and you can use these GPUs with 32-bit (but not 16-bit).

Be careful about the memory requirements when you pick your GPU. RTX cards, which can run in 16-bits, can train models which are twice as big with the same memory compared to GTX cards. As such RTX cards have a memory advantage and picking RTX cards and learn how to use 16-bit models effectively will carry you a long way. In general, the requirements for memory are roughly the following:

Research that is hunting state-of-the-art scores: >=11 GB
Research that is hunting for interesting architectures: >=8 GB
Any other research: 8 GB
Kaggle: 4 – 8 GB
Startups: 8 GB (but check the specific application area for model sizes)
Companies: 8 GB for prototyping, >=11 GB for training

Another problem to watch out for, especially if you buy multiple RTX cards is cooling. If you want to stick GPUs into PCIe slots which are next to each other you should make sure that you get GPUs with a blower-style fan. Otherwise you might run into temperature issues and your GPUs will be slower (about 30%) and die faster.

Suspect line-up
Can you identify the hardware part which is at fault for bad performance? One of these GPUs? Or maybe it is the fault of the CPU after all?

RAM

The main mistakes with RAM is to buy RAM with a too high clock rate. The second mistake is to buy not enough RAM to have a smooth prototyping experience.

Needed RAM Clock Rate

RAM clock rates are marketing stints where RAM companies lure you into buying “faster” RAM which actually yields little to no performance gains. This is best explained by “Does RAM speed REALLY matter?” video on RAM von Linus Tech Tips.

Furthermore, it is important to know that RAM speed is pretty much irrelevant for fast CPU RAM->GPU RAM transfers. This is so because (1) if you used pinned memory, your mini-batches will be transferred to the GPU without involvement from the CPU, and (2) if you do not use pinned memory the performance gains of fast vs slow RAMs is about 0-3% — spend your money elsewhere!

RAM Size

RAM size does not affect deep learning performance. However, it might hinder you from executing your GPU code comfortably (without swapping to disk). You should have enough RAM to comfortable work with your GPU. This means you should have at least the amount of RAM that matches your biggest GPU. For example, if you have a Titan RTX with 24 GB of memory you should have at least 24 GB of RAM. However, if you have more GPUs you do not necessarily need more RAM.

The problem with this “match largest GPU memory in RAM” strategy is that you might still fall short of RAM if you are processing large datasets. The best strategy here is to match your GPU and if you feel that you do not have enough RAM just buy some more.

A different strategy is influenced by psychology: Psychology tells us that concentration is a resource that is depleted over time. RAM is one of the few hardware pieces that allows you to conserve your concentration resource for more difficult programming problems. Rather than spending lots of time on circumnavigating RAM bottlenecks, you can invest your concentration on more pressing matters if you have more RAM. With a lot of RAM you can avoid those bottlenecks, save time and increase productivity on more pressing problems. Especially in Kaggle competitions, I found additional RAM very useful for feature engineering. So if you have the money and do a lot of pre-processing then additional RAM might be a good choice. So with this strategy, you want to have more, cheap RAM now rather than later.

CPU

The main mistake that people make is that people pay too much attention to PCIe lanes of a CPU. You should not care much about PCIe lanes. Instead, just look up if your CPU and motherboard combination supports the number of GPUs that you want to run. The second most common mistake is to get a CPU which is too powerful.

CPU and PCI-Express

People go crazy about PCIe lanes! However, the thing is that it has almost no effect on deep learning performance. If you have a single GPU, PCIe lanes are only needed to transfer data from your CPU RAM to your GPU RAM quickly. However, an ImageNet batch of 32 images (32x225x225x3) and 32-bit needs 1.1 milliseconds with 16 lanes, 2.3 milliseconds with 8 lanes, and 4.5 milliseconds with 4 lanes. These are theoretic numbers, and in practice you often see PCIe be twice as slow — but this is still lightning fast! PCIe lanes often have a latency in the nanosecond range and thus latency can be ignored.

Putting this together we have for an ImageNet mini-batch of 32 images and a ResNet-152 the following timing:

Forward and backward pass: 216 milliseconds (ms)
16 PCIe lanes CPU->GPU transfer: About 2 ms (1.1 ms theoretical)
8 PCIe lanes CPU->GPU transfer: About 5 ms (2.3 ms)
4 PCIe lanes CPU->GPU transfer: About 9 ms (4.5 ms)

Thus going from 4 to 16 PCIe lanes will give you a performance increase of roughly 3.2%. However, if you use PyTorch’s data loader with pinned memory you gain exactly 0% performance. So do not waste your money on PCIe lanes if you are using a single GPU!

When you select CPU PCIe lanes and motherboard PCIe lanes make sure that you select a combination which supports the desired number of GPUs. If you buy a motherboard that supports 2 GPUs, and you want to have 2 GPUs eventually, make sure that you buy a CPU that supports 2 GPUs, but do not necessarily look at PCIe lanes.

PCIe Lanes and Multi-GPU Parallelism

Are PCIe lanes important if you train networks on multiple GPUs with data parallelism? I have published a paper on this at ICLR2016, and I can tell you if you have 96 GPUs then PCIe lanes are really important. However, if you have 4 or fewer GPUs this does not matter much. If you parallelize across 2-3 GPUs, I would not care at all about PCIe lanes. With 4 GPUs, I would make sure that I can get a support of 8 PCIe lanes per GPU (32 PCIe lanes in total). Since almost nobody runs a system with more than 4 GPUs as a rule of thumb: Do not spend extra money to get more PCIe lanes per GPU — it does not matter!

Needed CPU Cores

To be able to make a wise choice for the CPU we first need to understand the CPU and how it relates to deep learning. What does the CPU do for deep learning? The CPU does little computation when you run your deep nets on a GPU. Mostly it (1) initiates GPU function calls, (2) executes CPU functions.

By far the most useful application for your CPU is data preprocessing. There are two different common data processing strategies which have different CPU needs.

The first strategy is preprocessing while you train:

Loop:

Load mini-batch
Preprocess mini-batch
Train on mini-batch

The second strategy is preprocessing before any training:

Preprocess data
Loop:
1. Load preprocessed mini-batch
2. Train on mini-batch

For the first strategy, a good CPU with many cores can boost performance significantly. For the second strategy, you do not need a very good CPU. For the first strategy, I recommend a minimum of 4 threads per GPU — that is usually two cores per GPU. I have not done hard tests for this, but you should gain about 0-5% additional performance per additional core/GPU.

For the second strategy, I recommend a minimum of 2 threads per GPU — that is usually one core per GPU. You will not see significant gains in performance when you have more cores if you are using the second strategy.

Needed CPU Clock Rate (Frequency)

When people think about fast CPUs they usually first think about the clock rate. 4GHz is better than 3.5GHz, or is it? This is generally true for comparing processors with the same architecture, e.g. “Ivy Bridge”, but it does not compare well between processors. Also, it is not always the best measure of performance.

In the case of deep learning there is very little computation to be done by the CPU: Increase a few variables here, evaluate some Boolean expression there, make some function calls on the GPU or within the program – all these depend on the CPU core clock rate.

While this reasoning seems sensible, there is the fact that the CPU has 100% usage when I run deep learning programs, so what is the issue here? I did some CPU core rate underclocking experiments to find out.

CPU underclocking on MNIST and ImageNet: Performance is measured as time taken on 200 epochs MNIST or a quarter epoch on ImageNet with different CPU core clock rates, where the maximum clock rate is taken as a baseline for each CPU. For comparison: Upgrading from a GTX 680 to a GTX Titan is about +15% performance; from GTX Titan to GTX 980 another +20% performance; GPU overclocking yields about +5% performance for any GPU

Note that these experiments are on a hardware that is dated, however, these results should still be the same for modern CPUs/GPUs.

Hard drive/SSD

The hard drive is not usually a bottleneck for deep learning. However, if you do stupid things it will hurt you: If you read your data from disk when they are needed (blocking wait) then a 100 MB/s hard drive will cost you about 185 milliseconds for an ImageNet mini-batch of size 32 — ouch! However, if you asynchronously fetch the data before it is used (for example torch vision loaders), then you will have loaded the mini-batch in 185 milliseconds while the compute time for most deep neural networks on ImageNet is about 200 milliseconds. Thus you will not face any performance penalty since you load the next mini-batch while the current is still computing.

However, I recommend an SSD for comfort and productivity: Programs start and respond more quickly, and pre-processing with large files is quite a bit faster. If you buy an NVMe SSD you will have an even smoother experience when compared to a regular SSD.

Thus the ideal setup is to have a large and slow hard drive for datasets and an SSD for productivity and comfort.

Power supply unit (PSU)

Generally, you want a PSU that is sufficient to accommodate all your future GPUs. GPUs typically get more energy efficient over time; so while other components will need to be replaced, a PSU should last a long while so a good PSU is a good investment.

You can calculate the required watts by adding up the watt of your CPU and GPUs with an additional 10% of watts for other components and as a buffer for power spikes. For example, if you have 4 GPUs with each 250 watts TDP and a CPU with 150 watts TDP, then you will need a PSU with a minimum of 4×250 + 150 + 100 = 1250 watts. I would usually add another 10% just to be sure everything works out, which in this case would result in a total of 1375 Watts. I would round up in this case an get a 1400 watts PSU.

One important part to be aware of is that even if a PSU has the required wattage, it might not have enough PCIe 8-pin or 6-pin connectors. Make sure you have enough connectors on the PSU to support all your GPUs!

Another important thing is to buy a PSU with high power efficiency rating – especially if you run many GPUs and will run them for a longer time.

Running a 4 GPU system on full power (1000-1500 watts) to train a convolutional net for two weeks will amount to 300-500 kWh, which in Germany – with rather high power costs of 20 cents per kWh – will amount to 60-100€ ($66-111). If this price is for a 100% efficiency, then training such a net with an 80% power supply would increase the costs by an additional 18-26€ – ouch! This is much less for a single GPU, but the point still holds – spending a bit more money on an efficient power supply makes good sense.

Using a couple of GPUs around the clock will significantly increase your carbon footprint and it will overshadow transportation (mainly airplane) and other factors that contribute to your footprint. If you want to be responsible, please consider going carbon neutral like the NYU Machine Learning for Language Group (ML2) — it is easy to do, cheap, and should be standard for deep learning researchers.

CPU and GPU Cooling

Cooling is important and it can be a significant bottleneck which reduces performance more than poor hardware choices do. You should be fine with a standard heat sink or all-in-one (AIO) water cooling solution for your CPU, but what for your GPU you will need to make special considerations.

Air Cooling GPUs

Air cooling is safe and solid for a single GPU or if you have multiple GPUs with space between them (2 GPUs in a 3-4 GPU case). However, one of the biggest mistakes can be made when you try to cool 3-4 GPUs and you need to think carefully about your options in this case.

Modern GPUs will increase their speed – and thus power consumption – up to their maximum when they run an algorithm, but as soon as the GPU hits a temperature barrier – often 80 °C – the GPU will decrease the speed so that the temperature threshold is not breached. This enables the best performance while keeping your GPU safe from overheating.

However, typical pre-programmed schedules for fan speeds are badly designed for deep learning programs, so that this temperature threshold is reached within seconds after starting a deep learning program. The result is a decreased performance (0-10%) which can be significant for multiple GPUs (10-25%) where the GPU heat up each other.

Since NVIDIA GPUs are first and foremost gaming GPUs, they are optimized for Windows. You can change the fan schedule with a few clicks in Windows, but not so in Linux, and as most deep learning libraries are written for Linux this is a problem.

The only option under Linux is to use to set a configuration for your Xorg server (Ubuntu) where you set the option “coolbits”. This works very well for a single GPU, but if you have multiple GPUs where some of them are headless, i.e. they have no monitor attached to them, you have to emulate a monitor which is hard and hacky. I tried it for a long time and had frustrating hours with a live boot CD to recover my graphics settings – I could never get it running properly on headless GPUs.

The most important point of consideration if you run 3-4 GPUs on air cooling is to pay attention to the fan design. The “blower” fan design pushes the air out to the back of the case so that fresh, cooler air is pushed into the GPU. Non-blower fans suck in air in the vincity of the GPU and cool the GPU. However, if you have multiple GPUs next to each other then there is no cool air around and GPUs with non-blower fans will heat up more and more until they throttle themselves down to reach cooler temperatures. Avoid non-blower fans in 3-4 GPU setups at all costs.

Water Cooling GPUs For Multiple GPUs

Another, more costly, and craftier option is to use water cooling. I do not recommend water cooling if you have a single GPU or if you have space between your two GPUs (2 GPUs in 3-4 GPU board). However, water cooling makes sure that even the beefiest GPU stay cool in a 4 GPU setup which is not possible when you cool with air. Another advantage of water cooling is that it operates much more silently, which is a big plus if you run multiple GPUs in an area where other people work. Water cooling will cost you about $100 for each GPU and some additional upfront costs (something like $50). Water cooling will also require some additional effort to assemble your computer, but there are many detailed guides on that and it should only require a few more hours of time in total. Maintenance should not be that complicated or effortful.

A Big Case for Cooling?

I bought large towers for my deep learning cluster, because they have additional fans for the GPU area, but I found this to be largely irrelevant: About 2-5 °C decrease, not worth the investment and the bulkiness of the cases. The most important part is really the cooling solution directly on your GPU — do not select an expensive case for its GPU cooling capability. Go cheap here. The case should fit your GPUs but thats it!

Conclusion Cooling

So in the end it is simple: For 1 GPU air cooling is best. For multiple GPUs, you should get blower-style air cooling and accept a tiny performance penalty (10-15%), or you pay extra for water cooling which is also more difficult to setup correctly and you have no performance penalty. Air and water cooling are all reasonable choices in certain situations. I would however recommend air cooling for simplicity in general — get a blower-style GPU if you run multiple GPUs. If you want to user water cooling try to find all-in-one (AIO) water cooling solutions for GPUs.

Motherboard

Your motherboard should have enough PCIe ports to support the number of GPUs you want to run (usually limited to four GPUs, even if you have more PCIe slots); remember that most GPUs have a width of two PCIe slots, so buy a motherboard that has enough space between PCIe slots if you intend to use multiple GPUs. Make sure your motherboard not only has the PCIe slots, but actually supports the GPU setup that you want to run. You can usually find information in this if you search your motherboard of choice on newegg and look at PCIe section on the specification page.

Computer Case

When you select a case, you should make sure that it supports full length GPUs that sit on top of your motherboard. Most cases support full length GPUs, but you should be suspicious if you buy a small case. Check its dimensions and specifications; you can also try a google image search of that model and see if you find pictures with GPUs in them.

If you use custom water cooling, make sure your case has enough space for the radiators. This is especially true if you use water cooling for your GPUs. The radiator of each GPU will need some space — make sure your setup actually fits into the GPU.

Monitors

I first thought it would be silly to write about monitors also, but they make such a huge difference and are so important that I just have to write about them.

The money I spent on my 3 27 inch monitors is probably the best money I have ever spent. Productivity goes up by a lot when using multiple monitors. I feel desperately crippled if I have to work with a single monitor. Do not short-change yourself on this matter. What good is a fast deep learning system if you are not able to operate it in an efficient manner?

Typical monitor layout when I do deep learning: Left: Papers, Google searches, gmail, stackoverflow; middle: Code; right: Output windows, R, folders, systems monitors, GPU monitors, to-do list, and other small applications.

Some words on building a PC

Many people are scared to build computers. The hardware components are expensive and you do not want to do something wrong. But it is really simple as components that do not belong together do not fit together. The motherboard manual is often very specific how to assemble everything and there are tons of guides and step by step videos which guide you through the process if you have no experience.

The great thing about building a computer is, that you know everything that there is to know about building a computer when you did it once, because all computer are built in the very same way – so building a computer will become a life skill that you will be able to apply again and again. So no reason to hold back!

Conclusion / TL;DR

GPU: RTX 2070 or RTX 2080 Ti. GTX 1070, GTX 1080, GTX 1070 Ti, and GTX 1080 Ti from eBay are good too!
CPU: 1-2 cores per GPU depending how you preprocess data. > 2GHz; CPU should support the number of GPUs that you want to run. PCIe lanes do not matter.

RAM:
– Clock rates do not matter — buy the cheapest RAM.
– Buy at least as much CPU RAM to match the RAM of your largest GPU.
– Buy more RAM only when needed.
– More RAM can be useful if you frequently work with large datasets.

Hard drive/SSD:
– Hard drive for data (>= 3TB)
– Use SSD for comfort and preprocessing small datasets.

PSU:
– Add up watts of GPUs + CPU. Then multiply the total by 110% for required Wattage.
– Get a high efficiency rating if you use a multiple GPUs.
– Make sure the PSU has enough PCIe connectors (6+8pins)

Cooling:
– CPU: get standard CPU cooler or all-in-one (AIO) water cooling solution
– GPU:
– Use air cooling
– Get GPUs with “blower-style” fans if you buy multiple GPUs
– Set coolbits flag in your Xorg config to control fan speeds

Motherboard:
– Get as many PCIe slots as you need for your (future) GPUs (one GPU takes two slots; max 4 GPUs per system)

Monitors:
– An additional monitor might make you more productive than an additional GPU.