CoDi-Any2Any Generation Model

Introduction#

This article briefly introduces CoDi.

CoDi can generate any combination of output modes from any input mode, such as language, image, video, or audio.

Main Content#

1. What is CoDi#

CoDi can generate any combination of language, image, video, or audio output modes from any combination of input modes, such as language, image, video, or audio. It aligns modes in the input and output spaces, builds a shared multimodal space by bridging alignments in the diffusion process, and enables conditioning on any input combination and generating any set of modes.

2. Generation Modes of CoDi#

Unlike existing generative AI systems, CoDi can generate multiple modes in parallel, and its input is not limited to subsets of modes such as text or images. In addition, the alignment of modes in CoDi's input and output spaces allows it to conditionally generate any input combination and generate any combination of modes, even if these modes do not exist in the training data.

13-CoDi_architecture

CoDi adopts a novel compositional generation strategy, which establishes a shared multimodal space in the diffusion process to achieve synchronized generation of interleaved modes, such as time-aligned videos and audios. The highly customizable and flexible CoDi achieves powerful joint multimodal generation quality and performs excellently in the state of single-modal synthesis or comparable to state-of-the-art single-modal synthesis techniques.

3. Conclusion#

Official demo, model structure and parameters.

There have been many recent developments in multimodal models, and researchers need to keep learning, but it is not necessary for the general users to read, as it can be mentally taxing.

Lastly#

References:

Official Homepage

Disclaimer#

This article is solely for personal learning purposes.

This article is synchronized with hblog.