Multimodal Prompt Perceiver: Empower Adaptiveness, Generalizability and Fidelity for All-in-One Image Restoration

CVPR 2024

1MAIS & CRIPAC, Institute of Automation, Chinese Academy of Sciences, 2School of Artificial Intelligence, University of Chinese Academy of Sciences, 3University of Science and Technology of China
Empty

Our MPerceiver excels in image restoration tasks with: (I) All-in-one: Addressing diverse degradations, including challenging mixed ones, through a single pretrained network. (II) Zero-shot: Handling training-unseen degradations effortlessly. (III) Few-shot: Adapting to new tasks with minimal data (about 3%-5% of data used by task-specific methods).

Abstract

Despite substantial progress, all-in-one image restoration (IR) grapples with persistent challenges in handling intricate real-world degradations. This paper introduces MPerceiver: a novel multimodal prompt learning approach that harnesses Stable Diffusion (SD) priors to enhance adaptiveness, generalizability and fidelity for all-in-one image restoration. Specifically, we develop a dual-branch module to master two types of SD prompts: textual for holistic representation and visual for multiscale detail representation. Both prompts are dynamically adjusted by degradation predictions from the CLIP image encoder, enabling adaptive responses to diverse unknown degradations. Moreover, a plug-in detail refinement module improves restoration fidelity via direct encoder-to-decoder information transformation. To assess our method, MPerceiver is trained on 9 tasks for all-in-one IR and outperforms state-of-the-art task-specific methods across many tasks. Post multitask pre-training, MPerceiver attains a generalized representation in low-level vision, exhibiting remarkable zero-shot and few-shot capabilities in unseen tasks. Extensive experiments on 16 IR tasks underscore the superiority of MPerceiver in terms of adaptiveness, generalizability and fidelity.

Method

Dual-branch Module

Illustration of MPerceiver’s dual-branch module with multimodal prompts. Textual Branch: CLIP image embeddings are transformed into text vectors through cross-modal inversion, which are then used alongside textual prompts as holistic representations for SD. Visual Branch: IR-Adapter decomposes VAE image embeddings into multi-scale features, which are then dynamically modulated by visual prompts to provide detail guidance for SD adaptively.

Empty

Detail Refinement Module

Empty

t-SNE Visualizations

Our multimodal prompt learning can effectively enable the network to distinguish different degradations.

Empty

Experiments

Quantitative Comparison

[All-in-one] Quantitative comparison with state-of-the-art task-specific methods and all-in-one methods on 9 tasks.

Empty

[All-in-one] Quantitative comparison on the proposed mixed degradation benchmark MID6.

Empty

[All-in-one] Quantitative comparison on real-world datasets of deraining, desnowing and motion deblurring.

Empty

Qualitative Comparison

Empty

Real-world visual results on dehazing and deraining.

Empty

Visual results on the MID6 benchmark (R: Rain; RD: RainDrop; N: Noise; LL: Low-Light; B: Blur).

Contact

If you have any questions, please feel free to contact with Yuang Ai at shallowdream555@gmail.com.

BibTeX

@InProceedings{ai2024mperceiver,
      author    = {Ai, Yuang and Huang, Huaibo and Zhou, Xiaoqiang and Wang, Jiexiang and He, Ran},
      title     = {Multimodal Prompt Perceiver: Empower Adaptiveness Generalizability and Fidelity for All-in-One Image Restoration},
      booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      month     = {June},
      year      = {2024},
      pages     = {25432-25444}
  }