Learn/Multiple Mentions What is multimodal inference? Multimodal inference processes multiple data types (text, images, audio, video) simultaneously in a single model run. Unlike traditional single-modality approaches, it lets models reason across formats in one pass. vllm-omni simplifies this with efficient backends for diffusion, audio, and video processing. DiffusionTokenisation |