基于OpenVINO的端到端DL网络-A Year in Computer Vision中关于图像增强系列部分

http://www.themtank.org/a-year-in-computer-vision

部分中文翻译汇总：https://blog.csdn.net/chengyq116/article/details/78660521

The M Tank 编辑了一份报告《A Year in Computer Vision》，记录了 2016 至 2017 年计算机视觉领域的研究成果，对开发者和研究人员来说是不可多得的一份详细材料。虽然该文已经过去一年多的时间了，但是考虑到研究成果由理论到落地的滞后性，里面的很多东西现在反而能够读出新味道。

目前在计算机视觉处理领域，有两类方法，一类是深度学习，一类是传统计算机视觉，在物体检测、目标识别等方面深度学习已经开始崭露优势，替代传统的计算机视觉，然而在其他视觉领域，包括光流计算或图像增强等依然是传统的计算机视觉处理方法更有优势。

以下是翻译和原文。

超分辨率、风格迁移和着色

计算机视觉领域的所有研究并非都是为了扩展机器的认知能力，神经网络以及其他ML技术常常适用于各种其他新颖的应用，这些应用往往和我们的日常生活精密联系。在这个方面，"超分辨率"、“风格转移”和“着色”的进步占据了整个领域。

1、超分辨率指的是从低分辨率对应物估计高分辨率图像的过程，以及不同放大倍数下图像特征的预测，这是人脑几乎毫不费力地完成的。最初的超分辨率是通过简单的技术，如bicubic-interpolation和最近邻。在商业应用方面，克服低分辨率限制和实现“CSI Miami”风格图像增强的愿望推动了该领域的研究。以下是今年的一些进展及其潜在的影响：

Neural Enhance 是Alex J. Champandard的创意，结合四篇不同研究论文的方法来实现超分辨率方法。

实时视频超分辨率解决方案也在2016年进行了两次著名的尝试。

RAISR：来自Google的快速而准确的图像超分辨率方法。通过使用低分辨率和高分辨率图像对训练滤波器，避免了神经网络方法的昂贵内存和速度要求。作为基于学习的框架，RAISR比同类算法快两个数量级，并且与基于神经网络的方法相比，具有最小的存储器需求。因此超分辨率可以扩展到个人设备。

生成对抗网络（GAN）的使用代表了当前用于超分辨率的SOTA：

SRGAN 通过训练区分超分辨率和原始照片真实图像的辨别器网络，在公共基准测试中提供多采样图像的逼真纹理。

尽管SRResNet在峰值信噪比（PSNR）方面的表现最佳，但SRGAN获得更精细的纹理细节并达到最佳的平均评分（MOS），SRGAN表现最佳。（SRGAN在人为主管测试上获得最佳）

“据我们所知，这是第一个能够推出4倍放大因子的照片般真实的自然图像的框架。”以前所有的方法都无法在较大的放大因子下恢复更精细的纹理细节。

Amortised MAP Inference for Image Super-resolution 提出了一种使用卷积神经网络计算最大后验（MAP）推断的方法。但是，他们的研究提出了三种优化方法，GAN在其中实时图像数据上表现明显更好。

2.Style Transfer集中体现了神经网络在公共领域的新用途，特别是去年的Facebook集成以及像Prisma 和Artomatix 这样的公司。（Prisma. Available: https://prisma-ai.com/ [Accessed: 01/04/2017].Artomatix. Available: https://services.artomatix.com/ [Accessed: 01/04/2017].）风格转换是一种较旧的技术，但在2015年出版了一个神经算法的艺术风格转换为神经网络。从那时起，风格转移的概念被Nikulin和Novak扩展，并且也被用于视频，就像计算机视觉中其他的共同进步一样。

图：风格迁移的例子

风格转换作为一个主题，一旦可视化是相当直观的，比如，拍摄一幅图像，并用不同的图像的风格特征呈现。例如，以着名的绘画或艺术家的风格。今年Facebook发布了Caffe2Go，将其深度学习系统整合到移动设备中。谷歌也发布了一些有趣的作品，试图融合多种风格，生成完全独特的图像风格。

除了移动端集成之外，风格转换还可以用于创建游戏资产。我们团队的成员最近看到了Artomatix的创始人兼首席技术官Eric Risser的演讲，他讨论了该技术在游戏内容生成方面的新颖应用（纹理突变等），因此大大减少了传统纹理艺术家的工作。【在动漫和游戏领域的确大有作为】

3、着色是将单色图像更改为新的全色版本的过程。最初，这是由那些精心挑选的颜色由负责每个图像中的特定像素的人手动完成的。2016年，这一过程自动化成为可能，同时保持了以人类为中心的色彩过程的现实主义的外观。虽然人类可能无法准确地表现给定场景的真实色彩，但是他们的真实世界知识允许以与图像一致的方式和观看所述图像的另一个人一致的方式应用颜色。

着色的过程是有趣的，因为网络基于对物体位置，纹理和环境的理解（例如，图像）为图像分配最可能的着色。它知道皮肤是粉红色，天空是蓝色的。

“而且，我们的架构可以处理任何分辨率的图像，而不像现在大多数基于CNN的方法。”

在一个测试中，他们的色彩是多么的自然，用户从他们的模型中得到一个随机的图像，并被问到，“这个图像看起来是自然的吗？

他们的方法达到了92.6％，基线达到了大约70％，而实际情况（实际彩色照片）被认为是自然的97.7％。

我为什么对这三者非常感兴趣，拿出来单独研究，原因是多方面的：

1、不同于分类、切割，这里的超分辨率、迁移、着色是随着DL技术发展得到了新活力，甚至是重生；

2、这三者都是增强算法，得到的是增强后的图片，而不是定量数据；

3、这三者都非常方便落地；

4、相对来说模型不是很复杂，特别是运算量比较小，适合研究。

而支撑这三者背后的，就是生成对抗网络(GAN)，这是一个需要单独研究的领域。

附：原文

Super-resolution, Style Transfer & Colourisation

Not all research in Computer Vision serves to extend the pseudo-cognitive abilities of machines, and often the fabled malleability of neural networks, as well as other ML techniques, lend themselves to a variety of other novel applications that spill into the public space. Last year’s advancements in Super-resolution, Style Transfer & Colourisation occupied that space for us.

Super-resolution refers to the process of estimating a high resolution image from a low resolution counterpart, and also the prediction of image features at different magnifications, something which the human brain can do almost effortlessly. Originally super-resolution was performed by simple techniques like bicubic-interpolation and nearest neighbours. In terms of commercial applications, the desire to overcome low-resolution constraints stemming from source quality and realisation of ‘CSI Miami’ style image enhancement has driven research in the field. Here are some of the year’s advances and their potential impact:

Neural Enhance^[65] is the brainchild of Alex J. Champandard and combines approaches from four different research papers to achieve its Super-resolution method.
Real-Time Video Super Resolution was also attempted in 2016 in two notable instances.^[66],^[67]
RAISR: Rapid and Accurate Image Super-Resolution^[68] from Google avoids the costly memory and speed requirements of neural network approaches by training filters with low-resolution and high-resolution image pairs. RAISR, as a learning-based framework, is two orders of magnitude faster than competing algorithms and has minimal memory requirements when compared with neural network-based approaches. Hence super-resolution is extendable to personal devices. There is a research blog available here.^[69]

Figure 7: Super-resolution SRGAN example

Note: From left to right: bicubic interpolation (the objective worst performer for focus), Deep residual network optimised for MSE, deep residual generative adversarial network optimized for a loss more sensitive to human perception, original High Resolution (HR) image. Corresponding peak signal to noise ratio (PSNR) and structural similarity (SSIM) are shown in two brackets. [4 x upscaling] The reader may wish to zoom in on the middle two images (SRResNet and SRGAN) to see the difference between image smoothness vs more realistic fine details.

Source: Ledig et al. (2017)^[70]

The use of Generative Adversarial Networks (GANs) represent current SOTA for Super-resolution:

SRGAN^[71] provides photo-realistic textures from heavily downsampled images on public benchmarks, using a discriminator network trained to differentiate between super-resolved and original photo-realistic images.

Qualitatively SRGAN performs the best, although SRResNet performs best with peak-signal-to-noise-ratio (PSNR) metric but SRGAN gets the finer texture details and achieves the best Mean Opinion Score (MOS). “To our knowledge, it is the first framework capable of inferring photo-realistic natural images for 4× upscaling factors.”^[72] All previous approaches fail to recover the finer texture details at large upscaling factors.

Amortised MAP Inference for Image Super-resolution^[73] proposes a method for calculation of Maximum a Posteriori (MAP) inference using a Convolutional Neural Network. However, their research presents three approaches for optimisation, all of which GANs perform markedly better on real image data at present.

Figure 8: Style Transfer from Nikulin & Novakle

Note: Transferring different styles to a photo of a cat (original top left).
Source: Nikulin & Novak (2016)

Undoubtedly, Style Transfer epitomises a novel use of neural networks that has ebbed into the public domain, specifically through last year’s facebook integrations and companies like Prisma^[74] and Artomatix^[75]. Style transfer is an older technique but converted to a neural networks in 2015 with the publication of a Neural Algorithm of Artistic Style.^[76] Since then, the concept of style transfer was expanded upon by Nikulin and Novak^[77] and also applied to video,^[78] as is the common progression within Computer Vision.

Figure 9: Further examples of Style Transfer

Note: The top row (left to right) represent the artistic style which is transposed onto the original images which are displayed in the first column (Woman, Golden Gate Bridge and Meadow Environment). Using conditional instance normalisation a single style transfer network can capture 32 style simultaneously, five of which are displayed here. The full suite of images in available in the source paper’s appendix. This work will feature in the International Conference on Learning Representations (ICLR) 2017.
Source: Dumoulin et al. (2017, p. 2)^[79]

Style transfer as a topic is fairly intuitive once visualised; take an image and imagine it with the stylistic features of a different image. For example, in the style of a famous painting or artist. This year Facebook released Caffe2Go,^[80] their deep learning system which integrates into mobile devices. Google also released some interesting work which sought to blend multiple styles to generate entirely unique image styles: Research blog^[81] and full paper.^[82]

Besides mobile integrations, style transfer has applications in the creation of game assets. Members of our team recently saw a presentation by the Founder and CTO of Artomatix, Eric Risser, who discussed the technique’s novel application for content generation in games (texture mutation, etc.) and, therefore, dramatically minimises the work of a conventional texture artist.

Colourisation is the process of changing monochrome images to new full-colour versions. Originally this was done manually by people who painstakingly selected colours to represent specific pixels in each image. In 2016, it became possible to automate this process while maintaining the appearance of realism indicative of the human-centric colourisation process. While humans may not accurately represent the true colours of a given scene, their real world knowledge allows the application of colours in a way which is consistent with the image and another person viewing said image.

The process of colourisation is interesting in that the network assigns the most likely colouring for images based on its understanding of object location, textures and environment, e.g. it learns that skin is pinkish and the sky is blueish.

Three of the most influential works of the year are as follows:

Zhang et al.^[83] produced a method that was able to successfully fool humans on 32% of their trials. Their methodology is comparable to a “colourisation Turing test.”

Larsson et al.^[84] fully automate their image colourisation system using Deep Learning for Histogram estimation.

Finally, Lizuka, Simo-Serra and Ishikawa^[85] demonstrate a colourisation model also based upon CNNs. The work outperformed the existing SOTA, we [the team] feel as though this work is qualitatively best also, appearing to be the most realistic. Figure 10 provides comparisons, however the image is taken from Lizuka et al.

Figure 10: Comparison of Colourisation Research

Note: From top to bottom - column one contains the original monochrome image input which is subsequently colourised through various techniques. The remaining columns display the results generated by other prominent colourisation research in 2016. When viewed from left to right, these are Larsson et al. 84 2016 (column two), Zhang et al. 83 2016 (Column three), and Lizuka, Simo-Serra and Ishikawa. 85 2016, also referred to as “ours” by the authors (Column four). The quality difference in colourisation is most evident in row three (from the top) which depicts a group of young boys. We believe Lizuka et al.’s work to be qualitatively superior (Column four).

Source: Lizuka et al. 2016^[86]

“Furthermore, our architecture can process images of any resolution, unlike most existing approaches based on CNN.”

In a test to see how natural their colourisation was, users were given a random image from their models and were asked, "does this image look natural to you?"

Their approach achieved 92.6%, the baseline achieved roughly 70% and the ground truth (the actual colour photos) were considered 97.7% of the time to be natural.

来自为知笔记(Wiz)