It also is developing the technology for a new generative model for images and video, with a planned debut in the first half of 2026, the reporting said, citing an internal Q&A with executives. The “Mango” multimodal system is being developed in Meta’s AI lab, which focuses on creating artificial general intelligence and is led by Scale AI co-founder Alexandr Wang. One other text-based model, “Avocado,” is also in development with an emphasis on coding and tool use, the outlets report.
Inside Meta’s New Multimodal Push for Images and Video
Mango is designed for high-fidelity image generation and long-form video creation, the kind of capability that has helped lift Sora and Veo into the spotlight. Those who are familiar with the roadmap describe a system that is capable of doing text-to-video, video-to-video and even fine-grained editing — in addition to robust still image performance. The strategic gamble is on what are known as “world models” that learn to approximate realistic representations of objects, physics and scenes so the system can think, plan and act instead of just stringing together plausible-looking frames.

That framing fits with public discussion of Meta’s research direction on grounded perception and planning, including work informed by energy-based learning and JEPA-style approaches favored by senior researchers. If Mango can preserve scene coherence over many seconds, maintain identity across shot changes and respect real-world constraints like lighting and gravity, it will clear some of the most conspicuous technical hurdles that hold back today’s consumer-facing video makers.
Timeline, compute, and ambition behind Mango’s launch plan
2026 would be an aggressive but conceivable window if Meta is able to rapidly scale up infrastructure. The company has said it is working toward an installed base of some 350,000 Nvidia H100-class GPUs and about 600,000 H100 equivalents across its fleet, an investment in capacity that matches the amount needed to train frontier-scale models. If Mango is to handle billions of daily requests across Instagram, WhatsApp and Facebook, it will need dedicated AI data centers, high-bandwidth interconnects and optimized inference stacks.
The competitive bar is high. OpenAI’s Sora has shown minute-long clips with great temporal coherence; Google’s Veo boasts hacks for 1080p generation and better camera controls; startups like Runway and Pika are iterating fast on creator-friendly tooling. For Mango to differentiate itself, all this will need to be done at high resolution with fine control over style and motion, robust audio and caption alignment, and near real-time rendering — ideally, in fact, for a fraction of today’s inference cost.
What Avocado Means For Coding (And Agents)
And Avocado, the text engine rumoured to be the code optimised model, suggests there’s an underlying agentic stack.
Good code synthesis and tool orchestration may empower Meta to chain models together: Avocado for planning and tool use, Mango for visual content, with specialized modules for search, memory and safety. Such a system could help create Reels variations in an automated way, localize ads and string together highlight clips from longer videos or facilitate creators with storyboard-to-shot workflows.

Meta also has a foundation to build on. Its Emu family brought fast image and short-form video creation, Llama models broadened developer mindshare in open ecosystems. If Avocado does provide a step-change in code rationale and API invocation, both of these enable new scales of internal automation and reduce latency between thought and publishable asset.
Data and safety are the obstacles for Mango’s rollout
State-of-the-art video models require access to large, high-quality, rights-cleared datasets for training. Legal scholars now calculate that the industry has moved toward a combination of licensed material, publicly available data and synthetic augmentation after some of the biggest companies started unveiling deals to license media and news in 2024 or 2025. Meta’s potential edge is the massive trove of user-generated content across its apps, but that comes with greater privacy, consent and regional regulatory limitations — particularly in the EU under the DSA and GDPR.
Safety will be equally pivotal. Immense peril in realistic video generation: deepfakes threaten elections, public figures and brand integrity. Look for watermarking, provenance capabilities through Content Credentials or C2PA, classifier-based misuse detection and policy enforcement that conforms to standards frameworks — including NIST’s AI Risk Management guidance. The public demos of Mango will be judged as much on the watermark’s invincibility and guardrails as on the cinematic flair.
Why Meta wants a win in next-gen video and image AI
Today, Meta’s AI traction mostly derives from distribution: its assistant appears everywhere inside the apps used by billions of people, if not yet the go-to destination for power users. Meanwhile, rivals have stolen mindshare with flashy model shows and fast product loops. There are also rumors of leadership shuffling, researcher turnover in Meta’s superintelligence group and high-profile exits as reminders that Mango and Avocado are under pressure to deliver.
If Mango reaches its 2026 goal and with it, compelling quality, controllability, and safety come to the fore in earnest, Meta might just revolutionize creative workflows for consumers, advertisers and professionals within its ecosystem. Should it falter or disappoint, the epicenter of video foundation models could coalesce elsewhere. So over the next year, look for research teasers, third-party benchmark results and more leaders on the video and image leaderboards, compute disclosures and content provenance commitments — the clearest signal of all to show how close Meta is to turning Mango into a flagship product.
