【Paper】Image Inpainting 必读papers-自己整理

Image Inpainting 必读papers

要搭建自己脑海里的关于Image Inpainting的论文树，知道每一篇paper的insight，尽可能多的去理解。

2016年

开山之作《Context-Encoders:Feature Learning by Inpainting》

link
CVPR，2016——作者: Deepak Pathak、Philipp Krahenbuhl、Jeff Donahue、Trevor Darrell、Alexei A. Efros
We present an unsupervised visual feature learning algorithm driven by context-based pixel prediction. By analogy with auto-encoders, we propose Context Encoders -- a convolutional neural network trained to generate the contents of an arbitrary image region conditioned on its surroundings. In order to succeed at this task, context encoders need to both understand the content of the entire image, as well as produce a plausible hypothesis for the missing part(s). When training context encoders, we have experimented with both a standard pixel-wise reconstruction loss, as well as a reconstruction plus an adversarial loss. The latter produces much sharper results because it can better handle multiple modes in the output. We found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures. We quantitatively demonstrate the effectiveness of our learned features for CNN pre-training on classification, detection, and segmentation tasks. Furthermore, context encoders can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.
通过Encoder与Decoder网络+对抗生成网络实现图像修复功能，这里的Encoder与Decoder网络都是全卷积网络，在Encoder的最后一层通过全通道链接实现信息传递，然后通过五个反卷积层实现修复区域生成，网络结构如下：
解析：Encoder的输入是带有mask的图片，解码器输出的是一张inpainting的图片。再利用生成对抗网络的判别器思想去训练这个输出的图片。使用L2 loss和adversarial loss作为损失函数，L2 loss可以使得mask里面的内容被恢复,adversarial loss则使得恢复的内容显得更加真实。【其实还是一个GAN的结果，只不过生成器用Encoder和Decoder代替了(上下文编解码器)，输入是原图片的imageSize，输出是 inpainting的imageSize。】
Insight：
1. Encoder Features到Decoder Features之间通过有一个Channel-wise Fully Connected层，该层使得解码器中的每个单元可以对整个图像内容进行推理。如果仅仅使用全连接编解码器会导致参数爆炸。
2. 原文：rather,the network is trained for context prediction "from scratch" with randomly initialized weights. 编码器结构源自AlexNet架构(到512x4x4那一层为止)，但不是为了分类，而是用随机初始化的权重方式训练上下文的内容信息。【个人理解，就是为了让中间这一层有一个更好的上下文信息。如果把前面的Encoder去掉，那就是一个随机向量的Decoder过程。现在加上了前面的Encoder，就是为了增加一个上下文信息。】
1. 损失函数
  
  L2 Loss：
  
  x是输入的原图(没有经过任何处理)；
  
  遮挡矩阵：其size和x的size一致，元素只有0和1两种值。1表示的是被遮挡，0表示未被遮挡；
  
  F()里面这一坨：表示的就是 x中被遮挡区域的矩阵，如上图的 3x64x64 图片；
  
  M⊙x-F()：表示的是 x原遮挡区域和生成的遮挡区域的差值；
  
  再取一个2次方就是L2 Loss了。
  
  对抗损失Adversaraial Loss：
  
  对抗损失就是生成对抗网络的损失了，都是老朋友了。
  
  联合损失：
  
  最终的联合损失就是分别给 L2损失和对抗损失设置两个不同的权重系数。【代码里让这两个权重系数相加等于1。】
  
  【根据代码来看，训练判别器时还是要且只要优化对抗损失；而训练生成器，即这个上下文编解码器时要优化联合损失】
运行结果如下：
裁剪图片中心-矩阵

# imageSize = 256
# overlapPred = 4
#overlapPred——overlapping edges 
w = int(imageSize/4+overlapPred)
w_ = int(imageSize/4+imageSize/2-overlapPred)

# input_cropped是生成器的输入 img_size是 3*128*128
input_cropped.data[:,0,w:w_,w:w_] = 2*117.0/255.0 - 1.0
input_cropped.data[:,1,w:w_,w:w_] = 2*104.0/255.0 - 1.0
input_cropped.data[:,2,w:w_,w:w_] = 2*123.0/255.0 - 1.0


crop_size = int(imageSize/4)
crop_size_w = int(imageSize/4)+int(imageSize/2)
# real_center_cpu是判别器的输入 img_size是 3*64*64
real_center_cpu = real_cpu[:,:,crop_size:crop_size_w,crop_size:crop_size_w]

模型结构

# nc = 3
# nef = 64——of encoder filters in first conv layer
# ngf = 64
# nBottleneck = 4000——of dim for bottleneck of encoder
import torch
import torch.nn as nn

class _netG(nn.Module):
    def __init__(self, opt):
        super(_netG, self).__init__()
        self.ngpu = ngpu
        self.main = nn.Sequential(
            # input is (nc) x 128 x 128
            nn.Conv2d(nc,nef,4,2,1, bias=False),
            nn.LeakyReLU(0.2, inplace=True),
            # state size: (nef) x 64 x 64
            nn.Conv2d(nef,nef,4,2,1, bias=False),
            nn.BatchNorm2d(nef),
            nn.LeakyReLU(0.2, inplace=True),
            # state size: (nef) x 32 x 32
            nn.Conv2d(nef,nef*2,4,2,1, bias=False),
            nn.BatchNorm2d(nef*2),
            nn.LeakyReLU(0.2, inplace=True),
            # state size: (nef*2) x 16 x 16
            nn.Conv2d(nef*2,nef*4,4,2,1, bias=False),
            nn.BatchNorm2d(nef*4),
            nn.LeakyReLU(0.2, inplace=True),
            # state size: (nef*4) x 8 x 8
            nn.Conv2d(nef*4,nef*8,4,2,1, bias=False),
            nn.BatchNorm2d(nef*8),
            nn.LeakyReLU(0.2, inplace=True),
            # state size: (nef*8) x 4 x 4
            nn.Conv2d(nef*8,nBottleneck,4, bias=False),
            # tate size: (nBottleneck) x 1 x 1
            nn.BatchNorm2d(nBottleneck),
            nn.LeakyReLU(0.2, inplace=True),
            # input is Bottleneck, going into a convolution
            nn.ConvTranspose2d(nBottleneck, ngf * 8, 4, 1, 0, bias=False),
            nn.BatchNorm2d(ngf * 8),
            nn.ReLU(True),
            # state size. (ngf*8) x 4 x 4
            nn.ConvTranspose2d(ngf * 8, ngf * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf * 4),
            nn.ReLU(True),
            # state size. (ngf*4) x 8 x 8
            nn.ConvTranspose2d(ngf * 4, ngf * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf * 2),
            nn.ReLU(True),
            # state size. (ngf*2) x 16 x 16
            nn.ConvTranspose2d(ngf * 2, ngf, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf),
            nn.ReLU(True),
            # state size. (ngf) x 32 x 32
            nn.ConvTranspose2d(ngf, nc, 4, 2, 1, bias=False),
            nn.Tanh()
            # state size. (nc) x 64 x 64
        )

        def forward(self, input):
            if isinstance(input.data, torch.cuda.FloatTensor) and self.ngpu > 1:
                output = nn.parallel.data_parallel(self.main, input, range(self.ngpu))
                else:
                    output = self.main(input)
            return output
  
class _netlocalD(nn.Module):
    def __init__(self, opt):
        super(_netlocalD, self).__init__()
        self.ngpu = ngpu
        self.main = nn.Sequential(
            # input is (nc) x 64 x 64
            nn.Conv2d(nc, ndf, 4, 2, 1, bias=False),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. (ndf) x 32 x 32
            nn.Conv2d(ndf, ndf * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ndf * 2),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. (ndf*2) x 16 x 16
            nn.Conv2d(ndf * 2, ndf * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ndf * 4),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. (ndf*4) x 8 x 8
            nn.Conv2d(ndf * 4, ndf * 8, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ndf * 8),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. (ndf*8) x 4 x 4
            nn.Conv2d(ndf * 8, 1, 4, 1, 0, bias=False),
            nn.Sigmoid()
        )

        def forward(self, input):
            if isinstance(input.data, torch.cuda.FloatTensor) and self.ngpu > 1:
                output = nn.parallel.data_parallel(self.main, input, range(self.ngpu))
                else:
                    output = self.main(input)

             return output.view(-1, 1)

损失函数

# wtl2——0 means do not use else use with this weight
criterion = nn.BCELoss() #交叉熵损失函数
#-------判别器D的损失函数——————————————
errD_real = criterion(output, label)
errD_real.backward()

errD_fake = criterion(output, label)
errD_fake.backward()
errD = errD_real + errD_fake

#-------生成器G的损失函数——————————————
errG_D = criterion(output, label)
wtl2Matrix = real_center_cpu.clone()
wtl2Matrix.data.fill_(wtl2*overlapL2Weight)
x = int(opt.imageSize/2 - opt.overlapPred)
wtl2Matrix.data[:,:,int(opt.overlapPred):x,int(opt.overlapPred):x] = wtl2

# errG_l2 = criterionMSE(fake,real_center) 也可以用MSE的损失函数
errG_l2 = (fake-real_center).pow(2)
errG_l2 = errG_l2 * wtl2Matrix
errG_l2 = errG_l2.mean()

errG = (1-wtl2) * errG_D + wtl2 * errG_l2
errG.backward()

几个不懂的点：
1. 分组卷积？overlapPred是什么意思？
2. nn.Conv2d()参数、nn.BatchNorm2d()功能和参数、nn.ConvTranspose2d()功能和参数

2017年

High-Resolution Image Inpainting using Multi-Scale Neural Patch Synthesis

link
CVPR，2017——作者: Chao Yang、Xin Lu、Zhe Lin、Eli Shechtman、Oliver Wang、Hao Li

Generative Image Inpainting with Contextual Attention

link
CVPR，2017——作者: Jiahui Yu、Zhe Lin、Jimei Yang、Xiaohui Shen、Xin Lu、Thomas Huang
Recent advances in deep learning have shown exciting promise in filling large holes in natural images with semantically plausible and context aware details, impacting fundamental image manipulation tasks such as object removal. While these learning-based methods are significantly more effective in capturing high-level features than prior techniques, they can only handle very low-resolution inputs due to memory limitations and difficulty in training. Even for slightly larger images, the inpainted regions would appear blurry and unpleasant boundaries become visible. We propose a multi-scale neural patch synthesis approach based on joint optimization of image content and texture constraints, which not only preserves contextual structures but also produces high-frequency details by matching and adapting patches with the most similar mid-layer feature correlations of a deep classification network. We evaluate our method on the ImageNet and Paris Streetview datasets and achieved state-of-the-art inpainting accuracy. We show our approach produces sharper and more coherent results than prior methods, especially for high-resolution images.

Globally and Locally Consistent Image Completion

ACM Transaction on Graphics，2017——作者: (早稻田大学)SATOSHI IIZUKA、EDGAR SIMO-SERRA、HIROSHI ISHIKAWA
We present a novel approach for image completion that results in images that are both locally and globally consistent. With a fully-convolutional neural network, we can complete images of arbitrary resolutions by fillingin missing regions of any shape. To train this image completion network to be consistent, we use global and local context discriminators that are trained to distinguish real images from completed ones. The global discriminator looks at the entire image to assess if it is coherent as a whole, while the local discriminator looks only at a small area centered at the completed region to ensure the local consistency of the generated patches. The image completion network is then trained to fool the both context discriminator networks, which requires it to generate images that are indistinguishable from real ones with regard to overall consistency as well as in details. We show that our approach can be used to complete a wide variety of scenes. Furthermore, in contrast with the patch-based approaches such as PatchMatch, our approach can generate fragments that do not appear elsewhere in the image, which allows us to naturally complete the images of objects with familiar and highly specifc structures, such as faces.

2018年

Contextual-based Image Inpainting: Infer, Match, and Translate

link
ECCV，2018——作者: Yuhang Song、Chao Yang、 Zhe Lin、Xiaofeng Liu、Qin Huang、 Hao Li
We study the task of image inpainting, which is to fill in the missing region of an incomplete image with plausible contents. To this end, we propose a learning-based approach to generate visually coherent completion given a high-resolution image with missing components. In order to overcome the difficulty to directly learn the distribution of high-dimensional image data, we divide the task into inference and translation as two separate steps and model each step with a deep neural network. We also use simple heuristics to guide the propagation of local textures from the boundary to the hole. We show that, by using such techniques, inpainting reduces to the problem of learning two image-feature translation functions in much smaller space and hence easier to train. We evaluate our method on several public datasets and show that we generate results of better visual quality than previous state-of-the-art methods.

Image Inpainting for Irregular Holes Using Partial Convolutions

link
ECCV，2018——作者: Guilin Liu、Fitsum A. Reda、Kevin J. Shih、Ting-Chun Wang、Andrew Tao
Existing deep learning based image inpainting methods use a standard convolutional network over the corrupted image, using convolutional filter responses conditioned on both valid pixels as well as the substitute values in the masked holes (typically the mean value). This often leads to artifacts such as color discrepancy and blurriness. Post-processing is usually used to reduce such artifacts, but are expensive and may fail. We propose the use of partial convolutions, where the convolution is masked and renormalized to be conditioned on only valid pixels. We further include a mechanism to automatically generate an updated mask for the next layer as part of the forward pass. Our model outperforms other methods for irregular masks. We show qualitative and quantitative comparisons with other methods to validate our approach.

2019年

SinGAN: Learning a Generative Model from a Single Natural Image

link
ICCV，2019——作者:Tamar Rott Shaham,Tali Dekel, Tomer Michaeli
We introduce SinGAN, an unconditional generative model that can be learned from a single natural image. Our model is trained to capture the internal distribution of patches within the image, and is then able to generate high quality, diverse samples that carry the same visual content as the image. SinGAN contains a pyramid of fully convolutional GANs, each responsible for learning the patch distribution at a different scale of the image. This allows generating new samples of arbitrary size and aspect ratio, that have significant variability, yet maintain both the global structure and the fine textures of the training image. In contrast to previous single image GAN schemes, our approach is not limited to texture images, and is not conditional (i.e. it generates samples from noise). User studies confirm that the generated samples are commonly confused to be real images. We illustrate the utility of SinGAN in a wide range of image manipulation tasks.

Free-Form Image Inpainting with Gated Convolution

link
ICCV，2019——作者: Jiahui Yu、Zhe Lin、Jimei Yang、Xiaohui Shen、Xin Lu、Thomas Huang
We present a generative image inpainting system to complete images with free-form mask and guidance. The system is based on gated convolutions learned from millions of images without additional labelling efforts. The proposed gated convolution solves the issue of vanilla convolution that treats all input pixels as valid ones, generalizes partial convolution by providing a learnable dynamic feature selection mechanism for each channel at each spatial location across all layers. Moreover, as free-form masks may appear anywhere in images with any shape, global and local GANs designed for a single rectangular mask are not applicable. Thus, we also present a patch-based GAN loss, named SN-PatchGAN, by applying spectral-normalized discriminator on dense image patches. SN-PatchGAN is simple in formulation, fast and stable in training. Results on automatic image inpainting and user-guided extension demonstrate that our system generates higher-quality and more flexible results than previous methods. Our system helps user quickly remove distracting objects, modify image layouts, clear watermarks and edit faces. Code, demo and models are available at: https://github.com/JiahuiYu/generative_inpainting

Coherent Semantic Attention for Image Inpainting

link
ICCV，2019——作者: Hongyu Liu、Bin Jiang、Yi Xiao、Chao Yang
The latest deep learning-based approaches have shown promising results for the challenging task of inpainting missing regions of an image. However, the existing methods often generate contents with blurry textures and distorted structures due to the discontinuity of the local pixels. From a semantic-level perspective, the local pixel discontinuity is mainly because these methods ignore the semantic relevance and feature continuity of hole regions. To handle this problem, we investigate the human behavior in repairing pictures and propose a fined deep generative model-based approach with a novel coherent semantic attention (CSA) layer, which can not only preserve contextual structure but also make more effective predictions of missing parts by modeling the semantic relevance between the holes features. The task is divided into rough, refinement as two steps and model each step with a neural network under the U-Net architecture, where the CSA layer is embedded into the encoder of refinement step. To stabilize the network training process and promote the CSA layer to learn more effective parameters, we propose a consistency loss to enforce the both the CSA layer and the corresponding layer of the CSA in decoder to be close to the VGG feature layer of a ground truth image simultaneously. The experiments on CelebA, Places2, and Paris StreetView datasets have validated the effectiveness of our proposed methods in image inpainting tasks and can obtain images with a higher quality as compared with the existing state-of-the-art approaches.

EdgeConnect Generative Image Inpainting with Adversarial Edge Learning

link
2019，作者: Kamyar Nazeri、Eric Ng、Tony Joseph、 Faisal Z. Qureshi、Mehran Ebrahimi
Over the last few years, deep learning techniques have yielded significant improvements in image inpainting. However, many of these techniques fail to reconstruct reasonable structures as they are commonly over-smoothed and/or blurry. This paper develops a new approach for image inpainting that does a better job of reproducing filled regions exhibiting fine details. We propose a two-stage adversarial model EdgeConnect that comprises of an edge generator followed by an image completion network. The edge generator hallucinates edges of the missing region (both regular and irregular) of the image, and the image completion network fills in the missing regions using hallucinated edges as a priori. We evaluate our model end-to-end over the publicly available datasets CelebA, Places2, and Paris StreetView, and show that it outperforms current state-of-the-art techniques quantitatively and qualitatively. Code and models available at: https://github.com/knazeri/edge-connect

2020年

Rethinking Image Inpainting via a Mutual Encoder-Decoder with Feature Equalizations

link
ECCV，2020——作者:Hongyu Liu、Bin Jiang、Yibing Song、Wei Huang、Chao Yang
论文链接 code
Deep encoder-decoder based CNNs have advanced image inpainting methods for hole filling. While existing methods recover structures and textures step-by-step in the hole regions, they typically use two encoder-decoders for separate recovery. The CNN features of each encoder are learned to capture either missing structures or textures without considering them as a whole. The insufficient utilization of these encoder features limit the performance of recovering both structures and textures. In this paper, we propose a mutual encoder-decoder CNN for joint recovery of both. We use CNN features from the deep and shallow layers of the encoder to represent structures and textures of an input image, respectively. The deep layer features are sent to a structure branch and the shallow layer features are sent to a texture branch. In each branch, we fill holes in multiple scales of the CNN features. The filled CNN features from both branches are concatenated and then equalized. During feature equalization, we reweigh channel attentions first and propose a bilateral propagation activation function to enable spatial equalization. To this end, the filled CNN features of structure and texture mutually benefit each other to represent image content at all feature levels. We use the equalized feature to supplement decoder features for output image generation through skip connections. Experiments on the benchmark datasets show the proposed method is effective to recover structures and textures and performs favorably against state-of-the-art approaches.

Contextual Residual Aggregation for Ultra High-Resolution Image Inpainting

link
CVPR，2020——作者: Zili Yi,Qiang Tang, Shekoofeh Azizi,Daesik Jang, Zhan Xu
Recently data-driven image inpainting methods have made inspiring progress, impacting fundamental image editing tasks such as object removal and damaged image repairing. These methods are more effective than classic approaches, however, due to memory limitations they can only handle low-resolution inputs, typically smaller than 1K. Meanwhile, the resolution of photos captured with mobile devices increases up to 8K. Naive up-sampling of the low-resolution inpainted result can merely yield a large yet blurry result. Whereas, adding a high-frequency residual image onto the large blurry image can generate a sharp result, rich in details and textures. Motivated by this, we propose a Contextual Residual Aggregation (CRA) mechanism that can produce high-frequency residuals for missing contents by weighted aggregating residuals from contextual patches, thus only requiring a low-resolution prediction from the network. Since convolutional layers of the neural network only need to operate on low-resolution inputs and outputs, the cost of memory and computing power is thus well suppressed. Moreover, the need for high-resolution training datasets is alleviated. In our experiments, we train the proposed model on small images with resolutions 512x512 and perform inference on high-resolution images, achieving compelling inpainting quality. Our model can inpaint images as large as 8K with considerable hole sizes, which is intractable with previous learning-based approaches. We further elaborate on the light-weight design of the network architecture, achieving real-time performance on 2K images on a GTX 1080 Ti GPU. Codes are available at: Atlas200dk/sample-imageinpainting-HiFill

Generating Diverse Structure for Image Inpainting With Hierarchical VQ-VAE

link
CVPR，2021——作者: Jialun Peng、Dong Liu、 Songcen Xu、 Houqiang Li
论文链接 code
Given an incomplete image without additional constraint, image inpainting natively allows for multiple solutions as long as they appear plausible. Recently, multiplesolution inpainting methods have been proposed and shown the potential of generating diverse results. However, these methods have difficulty in ensuring the quality of each solution, e.g. they produce distorted structure and/or blurry texture. We propose a two-stage model for diverse inpainting, where the first stage generates multiple coarse results each of which has a different structure, and the second stage refines each coarse result separately by augmenting texture. The proposed model is inspired by the hierarchical vector quantized variational auto-encoder (VQ-VAE), whose hierarchical architecture isentangles structural and textural information. In addition, the vector quantization in VQVAE enables autoregressive modeling of the discrete distribution over the structural information. Sampling from the distribution can easily generate diverse and high-quality structures, making up the first stage of our model. In the second stage, we propose a structural attention module inside the texture generation network, where the module utilizes the structural information to capture distant correlations. We further reuse the VQ-VAE to calculate two feature losses, which help improve structure coherence and texture realism, respectively. Experimental results on CelebA-HQ, Places2, and ImageNet datasets show that our method not only enhances the diversity of the inpainting solutions but also improves the visual quality of the generated multiple images. Code and models are available at: https://github.com/USTC-JialunPeng/Diverse-Structure-Inpainting.

2021年

Image Inpainting with External-internal Learning and Monochromic Bottleneck

link
CVPR，2021——作者:Tengfei Wang、Hao Ouyang、Qifeng Chen
论文链接 code
Although recent inpainting approaches have demonstrated significant improvements with deep neural networks, they still suffer from artifacts such as blunt structures and abrupt colors when filling in the missing regions. To address these issues, we propose an external-internal inpainting scheme with a monochromic bottleneck that helps image inpainting models remove these artifacts. In the external learning stage, we reconstruct missing structures and details in the monochromic space to reduce the learning dimension. In the internal learning stage, we propose a novel internal color propagation method with progressive learning strategies for consistent color restoration. Extensive experiments demonstrate that our proposed scheme helps image inpainting models produce more structure-preserved and visually compelling results.

PD-GAN: Probabilistic Diverse GAN for Image Inpainting

link
CVPR，2021——作者:Hongyu Liu、Ziyu Wan,Wei Huang,Yibing Song, Xintong Han, Jing Liao
We propose PD-GAN, a probabilistic diverse GAN for image inpainting. Given an input image with arbitrary hole regions, PD-GAN produces multiple inpainting results with diverse and visually realistic content. Our PD-GAN is built upon a vanilla GAN which generates images based on random noise. During image generation, we modulate deep features of input random noise from coarse-to-fine by injecting an initially restored image and the hole regions in multiple scales. We argue that during hole filling, the pixels near the hole boundary should be more deterministic (i.e., with higher probability trusting the context and initially restored image to create natural inpainting boundary), while those pixels lie in the center of the hole should enjoy more degrees of freedom (i.e., more likely to depend on the random noise for enhancing diversity). To this end, we propose spatially probabilistic diversity normalization (SPDNorm) inside the modulation to model the probability of generating a pixel conditioned on the context information. SPDNorm dynamically balances the realism and diversity inside the hole region, making the generated content more diverse towards the hole center and resemble neighboring image content more towards the hole boundary. Meanwhile, we propose a perceptual diversity loss to further empower PD-GAN for diverse content generation. Experiments on benchmark datasets including CelebA-HQ, Places2 and Paris Street View indicate that PD-GAN is effective for diverse and visually realistic image restoration.

High-Fidelity Pluralistic Image Completion with Transformers

link
2021——作者: Ziyu Wan、 Jingbo Zhang、 Dongdong Chen、Jing Liao
Image completion has made tremendous progress with convolutional neural networks (CNNs), because of their powerful texture modeling capacity. However, due to some inherent properties (e.g., local inductive prior, spatial-invariant kernels), CNNs do not perform well in understanding global structures or naturally support pluralistic completion. Recently, transformers demonstrate their power in modeling the long-term relationship and generating diverse results, but their computation complexity is quadratic to input length, thus hampering the application in processing high-resolution images. This paper brings the best of both worlds to pluralistic image completion: appearance prior reconstruction with transformer and texture replenishment with CNN. The former transformer recovers pluralistic coherent structures together with some coarse textures, while the latter CNN enhances the local texture details of coarse priors guided by the high-resolution masked images. The proposed method vastly outperforms state-of-the-art methods in terms of three aspects: 1) large performance boost on image fidelity even compared to deterministic completion methods; 2) better diversity and higher fidelity for pluralistic completion; 3) exceptional generalization ability on large masks and generic dataset, like ImageNet.

Large Scale Image Completion via Co-Modulated Generative Adversarial Networks

link
ICLR，2021——作者: Shengyu Zhao、Jonathan Cui、 Yilun Sheng、Yue Dong 、Xiao Liang、Eric I Chang、 Yan Xu
Numerous task-specific variants of conditional generative adversarial networks have been developed for image completion. Yet, a serious limitation remains that all existing algorithms tend to fail when handling large-scale missing regions. To overcome this challenge, we propose a generic new approach that bridges the gap between image-conditional and recent modulated unconditional generative architectures via co-modulation of both conditional and stochastic style representations. Also, due to the lack of good quantitative metrics for image completion, we propose the new Paired/Unpaired Inception Discriminative Score (P-IDS/U-IDS), which robustly measures the perceptual fidelity of inpainted images compared to real images via linear separability in a feature space. Experiments demonstrate superior performance in terms of both quality and diversity over state-of-the-art methods in free-form image completion and easy generalization to image-to-image translation. Code is available at https://github.com/zsyzzsoft/co-mod-gan.

Image inpainting综述

Image inpainting: A review

link
2019，作者：Omar Elharrouss、Noor Almaadeed、Somaya Al-Maadeed、Younes Akbari
Although image inpainting, or the art of repairing the old and deteriorated images, has been around for many years, it has gained even more popularity because of the recent development in image processing techniques. With the improvement of image processing tools and the flexibility of digital image editing, automatic image inpainting has found important applications in computer vision and has also become an important and challenging topic of research in image processing. This paper is a brief review of the existing image inpainting approaches we first present a global vision on the existing methods for image inpainting. We attempt to collect most of the existing approaches and classify them into three categories, namely, sequential-based, CNN-based and GAN-based methods. In addition, for each category, a list of methods for the different types of distortion on the images is presented. Furthermore, collect a list of the available datasets and discuss these in our paper. This is a contribution for digital image inpainting researchers trying to look for the available datasets because there is a lack of datasets available for image inpainting. As the final step in this overview, we present the results of real evaluations of the three categories of image inpainting methods performed on the datasets used, for the different types of image distortion. In the end, we also present the evaluations metrics and discuss the performance of these methods in terms of these metrics. This overview can be used as a reference for image inpainting researchers, and it can also facilitate the comparison of the methods as well as the datasets used. The main contribution of this paper is the presentation of the three categories of image inpainting methods along with a list of available datasets that the researchers can use to evaluate their proposed methodology against.