Despite substantial progress, all-in-one image restoration (IR) grapples with persistent challenges in handling intricate real-world degradations. This paper introduces MPerceiver: a novel multimodal prompt learning approach that harnesses Stable Diffusion (SD) priors to enhance adaptiveness, generalizability and fidelity for all-in-one image restoration. Specifically, we develop a dual-branch module to master two types of SD prompts: textual for holistic representation and visual for multiscale detail representation. Both prompts are dynamically adjusted by degradation predictions from the CLIP image encoder, enabling adaptive responses to diverse unknown degradations. Moreover, a plug-in detail refinement module improves restoration fidelity via direct encoder-to-decoder information transformation. To assess our method, MPerceiver is trained on 9 tasks for all-in-one IR and outperforms state-of-the-art task-specific methods across many tasks. Post multitask pre-training, MPerceiver attains a generalized representation in low-level vision, exhibiting remarkable zero-shot and few-shot capabilities in unseen tasks. Extensive experiments on 16 IR tasks underscore the superiority of MPerceiver in terms of adaptiveness, generalizability and fidelity.
Illustration of MPerceiver’s dual-branch module with multimodal prompts. Textual Branch: CLIP image embeddings are transformed into text vectors through cross-modal inversion, which are then used alongside textual prompts as holistic representations for SD. Visual Branch: IR-Adapter decomposes VAE image embeddings into multi-scale features, which are then dynamically modulated by visual prompts to provide detail guidance for SD adaptively.
Our multimodal prompt learning can effectively enable the network to distinguish different degradations.
[All-in-one] Quantitative comparison with state-of-the-art task-specific methods and all-in-one methods on 9 tasks.
[All-in-one] Quantitative comparison on the proposed mixed degradation benchmark MID6.
[All-in-one] Quantitative comparison on real-world datasets of deraining, desnowing and motion deblurring.
Real-world visual results on dehazing and deraining.
Visual results on the MID6 benchmark (R: Rain; RD: RainDrop; N: Noise; LL: Low-Light; B: Blur).
If you have any questions, please feel free to contact with Yuang Ai at shallowdream555@gmail.com.
@InProceedings{ai2024mperceiver,
author = {Ai, Yuang and Huang, Huaibo and Zhou, Xiaoqiang and Wang, Jiexiang and He, Ran},
title = {Multimodal Prompt Perceiver: Empower Adaptiveness Generalizability and Fidelity for All-in-One Image Restoration},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {25432-25444}
}