AI
In real-world scenarios, both face images and videos may suffer from unknown and varied types of degradation, such as down-sampling, noise, blur, and compression. Blind Face Restoration (BFR) is a challenging task that aims at restoring low-quality faces suffering from unknown degradation. Existing BFR methods usually use facial priors such as reference prior, geometry prior, and generative prior in the network structure. Although these existing BFR methods work well in the blind face image restoration (BFIR) problem, they do not fully consider blind face videos. To the best of our knowledge, there is still no specialized method for restoring blind face videos.
In this paper, we present a Stable Blind Face Video Restoration (StableBFVR). As shown in Fig. 1, we introduce temporal layers in the Stable Diffusion[1] preserve temporal consistency. First, we propose Shift-Resblock which implicitly captures global information for long-term aggregation. Second, we further improve restoration performance and temporal consistency by introducing Nearby-Frame Attention to aggregate short-term information. Moreover, to enable adaptive responses to complex and large-range blind degradation, we propose a degradation-aware prompt module to encode degradation-specific information as prompts to guide the restoration network.
Figure 1. The architecture of the proposed StableBFVR.
StableBFVR uses the pre-trained latent diffusion model (LDM) Stable Diffusion as facial prior. In this work, we start with the pre-trained Stable Diffusion and create a new video diffusion model for blind face video restoration. By adopting temporal strategies within the LDM framework, our method can achieve temporal consistency while leveraging the prior knowledge from Stable Diffusion.
To maintain temporal consistency and use multiple frame information to improve the restoration performance, we introduce temporal layers to Stable Diffusion. As shown in Fig. 2, these temporal layers comprehensively consider both long-term and short-term information in the video. Specifically, we present Shift-ResBlock which uses the proposed forward temporal shift block and backward temporal shift block alternatively to achieve bi-directional aggregation. By using Shift-ResBlock repeatedly, the aggregation of long-term information is achieved. For short-term information aggregation, we introduce a Nearby-Frame Attention (NFA). By seeking complementary sharp information existing in neighboring frames, NFA can refine restoration details.
Figure 2. The structure of the proposed Shift-Resblock and Nearby-Frame Attention (NFA).
To further improve the restoration performance, we propose a Degradation-Aware Prompt Module (DAPM). As shown in Fig. 1, DAPM first extracts degradation-aware features from the input frames to predict prompt weights about different types of degradation. Then DAPM utilizes these weights to adjust the corresponding prompt corresponding to different types of degeneration and fuses these prompts to obtain degradation-aware prompts which encode discriminative information about various types of degradation.
For the synthetic VFHQ-Test, the quantitative results are shown in Tab. 1. The results indicate that our method achieves state-of-the-art performance on all perceptual metrics. Specifically, our StableBFVR achieves the best performance regarding LPIPS, indicating that the perceptual quality of restored face videos is closest to ground truth. Moreover, StableBFVR also obtains the best results of NIQE, MUSIQ, and CLIP-IQA, showing that the outputs better align with human visual and perceptive systems. To assess the generalization ability, we extend the evaluation of our model to the real-world dataset WebVideo-Test. StableBFVR exhibits superior performance across all three metrics NIQE, MUSIQ, and CLIP-IQA, showing its remarkable generalization capability. Furthermore, compared with video restoration methods, BFIR methods also show satisfactory performance, suggesting the importance of generative prior in the scenery of unknown degradations in the real world.
Table 1. Quantitative comparison on VFHQ-Test and WebVideo-Test for blind face video restoration.
For the synthetic VFHQ-Test, as shown in Fig. 3. Compared with video restoration methods, our method recovers faithful details in the eyes, mouth, beard etc. Our method treats the input as a whole in restoration and performs well in all regions. Our method can aggregate information from other frames to improve performance.
Figure 3. Visual comparison results of different methods on the VFHQ-Test.
For WebVideo-Test, our method produces realistic facial textures in the case of complicated real-world degradation. As shown in the last column of Fig. 4, previous BRIR methods fail to restore the hair textures on the image boundary, while ours is successful. Compared with video restoration methods, our method produces significantly more texture detail.
Figure 4. Visual comparison results of different methods on the real-world dataset WebVideo-Test.
To thoroughly verify our method, we visualize the consecutive frames generated by different methods in Fig. 5. It is observed that, although sequences restored by BFIR methods exhibit realistic texture, there are noticeable differences between the textures of continuous frames. Conversely, sequences restored by video restoration methods demonstrate commendable temporal consistency but tend to be excessively smooth, lacking textures. StableBFVR strikes a favorable balance, reconstructing more textures, while simultaneously preserving temporal consistency.
Figure 5. Visual comparisons of the temporal consistency for restored videos. (a) GFP-GAN, (b) CodeFormer, (c) DiffBIR, (d) RestoreFormer, (e) BaiscVSR++[2], (f) RVRT, (g) DSTNet, (h) Ours.
In this work, we tackle the BFVR problem for the first time. We propose StableBFVR leveraging the strong generative prior from the pre-trained generative model Stable Diffusion to restore face videos with realistic details. To ensure content consistency among frames and use multi-frame information for improved restoration, we develop Shift-Resblock and Nearby-Frame Attention to aggregate both long-term and short-term information. Additionally, we propose a Degradation-Aware Prompt Module to dynamically guide the restoration process and further enhance performance. Extensive experiments show that our StableBFVR achieves superior performance than video restoration methods and blind face image restoration methods.
https://openreview.net/forum?id=qaIS3nvAem
[1] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition.
[2] Kelvin CK Chan, Shangchen Zhou, Xiangyu Xu, and Chen Change Loy. 2022. Basicvsr++: Improving video super-resolution with enhanced propagation and alignment. In IEEE Conference on Computer Vision and Pattern Recognition.