Nvidia Cosmos: WFM platform

Here are the contents translated into English, excluding any preambles and other content:

Performance: Its performance is superior to all known word processors, running 12 times faster.
Generating High-Quality Videos: Can generate detailed videos from text/image input data.
Predicting Scene Evolution: Can predict the evolution of a scene, helping developers imagine and simulate future scenarios.
Security System: Provides a powerful security system to protect developers from harmful inputs or outputs.

Additionally:

Instruction-Based Video Prediction: Researchers demonstrated two tasks: instruction-based video prediction and action-based next-frame prediction.
Cosmos-1X Dataset: Created a dataset with approximately 200 hours of first-person videos, including navigation, folding clothes, and cleaning a desk.
Bridge Dataset: Used a public dataset containing approximately 20,000 third-person perspective videos, showing a robotic arm performing various tasks in a kitchen environment.
Real-World Driving Scene (RDS) Dataset: Created an internal dataset with approximately 360 million 20-second ring-camera video clips.
Multi-Angle World Model: Fine-tuned the Cosmos-1.0-Diffusion-7B-Text2World model using RDS data and created a multi-angle world model.

In summary, everything in the physical simulation world can be generated through NVIDIA’s Cosmos.

本文讨论了一种新型分词器，称为Cosmos，其性能远超目前所有已知的分词器。研究人员在Cosmos上进行微调并创造了一个能够适用于各种物理AI任务的后训练模型，WFM（World Foundation Model）。该模型可以从文本/图像输入生成详细的视频，并预测场景的演变。

Cosmos及其WFM模型在以下方面表现出色：

此外，本文还讨论了以下内容：

总之，物理模拟世界的一切都可以通过英伟达Cosmos来生成出来。

https://arxiv.org/pdf/2501.03575