Google on Tuesday, May 19, used the I/O 2026 keynote to introduce Gemini Omni, a new model the laboratory described as a leap forward in world understanding, multimodality, and editing — designed to generate any output from any input, with launch capabilities centered on video.

The release lands alongside two adjacent Gemini moves: Gemini 3.5 Flash, the lab's new lightweight model, positioned at roughly half — and in some cases close to one-third — the price of comparable frontier models; and Gemini Spark, a new general-purpose AI agent that reasons across information in connected apps, in beta for trusted testers and Google AI Ultra subscribers.

For the multimodal beat, Omni is the headline. The framing — "any output from any input" — is more ambitious than what GPT-5.5 Instant's release headlined two weeks earlier, and represents a deliberate competitive positioning along the modality axis rather than the reliability axis OpenAI has been emphasizing.

What we know from Google's launch communications:

Modality envelope. Omni accepts text, image, audio, and video inputs, and generates text, image, audio, and video outputs. The video output capability is the headline new piece.
Editing. Google emphasized editing as a primary modality, not just generation — Omni is positioned as a system that can take an existing video or image and modify it in coherent multimodal ways.
World understanding. Google's framing — distinct from "image classification" or "video tagging" — claims a level of scene comprehension that supports complex multimodal reasoning across the inputs.

What we don't know from the launch:

Benchmark scores on contemporary multimodal evaluations (MMMU, ChartQA, DocVQA, Video-MME).
Token-equivalent costs of generation versus input across the modalities.
Latency profiles for video-out at production resolutions.
The model's behavior on adversarial prompts that mix modalities (image + text injection, audio + text contradictions).

Operators interested in deploying Omni for production workloads will want to wait for the third-party leaderboard movement that typically lands in the two weeks after a major Google model release. The frontier multimodal benchmark suite — Video-MME for long-video, MMMU for cross-modal reasoning, ChartQA and DocVQA for chart-and-document — gives a comparable signal against Anthropic Opus 4.7 and OpenAI's GPT-5.5 family.

Google's choice to position Omni alongside Spark suggests the laboratory expects the next twelve months of competitive differentiation to come from the multimodal-plus-agentic combination rather than from text-only frontier gains. That is a defensible strategic read of where the field is heading; whether Omni's actual quality matches the framing is the question the next month answers.

Google Gemini Omni: world-understanding multimodal at scale, any-input-to-any-output

Sources