A Closer Look at Fourier Spectrum Discrepancies for CNN-generated Images Detection

Background

Recent research suggests that CNN-based generation methods are unable to reproduce high frequency distribution of real images. Therefore, this paper investigate the validity of assertions claiming that CNN-generated images are unable to achieve high frequency spectral decay consistency.

Assumption

They hypothesize that the last upsampling layer are directly responsible for the high frequency discrepancies in the generated image.
The high frequency discrepancies are not intrinsic for CNN-generated images.

Experiments

Theses Experiments mainly research the last upsampling layer and the followed convolutional layer how to affect the high-frequency discrepancies.

As shown in Table 1, this paper investigates the different setting, like upsampling layers, kernel sizes, and the number of convolutional blocks in the tail of network.

As shown in Fig.3, images from Z.1.5 and Baseline experiments are spectral inconsistent for all 3 GANs. Nearest and bilinear interpolation methods are able to replicate the spectral distribution of real data reasonably across all 3 GAN models. These results seems show that GANs would not result in high frequency Fourier discrepancies but the different upsampling layers would. As seen from Fig4, it seems show that larger kernels will be better that smaller kernerls.

Due to nearest and bilinear interpolation methods obtain spectral consistent GANs, the author test whether the classifier proposed above would be robust enough to detect these samples as fake. As shown in Table 2, Z methods obtain the best results and other is too bad, which show that the classifiers only learn the high-frequency attributes to classify images.

Strength

This paper provides counterexamples to argue that high frequency spectral decay discrepancies are not inherent characteristics of CNN-generated images. Through modifying the last upsampling layer, it seems the high-frequency discrepancies can be mitigate.

Weakness

To show the CNN will not result in high-frequency discrepancies, the author conducts a lot of experiments. These experiments seem show that transposed convolution and zero interpolation will lead to high-frequency discrepancies. I think it is because transposed convolution is the learning-based method and zero interpolation will make image too sparse. But to the image restoration tasks, like image super-resolution, the transposed convolution is the best choice than nearest and bicubic now. I think the author do extend experiments on image restoration task will be more persuasive.