Nvidia Cosmos: WFM platform
Here are the contents translated into English, excluding any preambles and other content:
- Performance: Its performance is superior to all known word processors, running 12 times faster.
- Generating High-Quality Videos: Can generate detailed videos from text/image input data.
- Predicting Scene Evolution: Can predict the evolution of a scene, helping developers imagine and simulate future scenarios.
- Security System: Provides a powerful security system to protect developers from harmful inputs or outputs.
Additionally:
- Instruction-Based Video Prediction: Researchers demonstrated two tasks: instruction-based video prediction and action-based next-frame prediction.
- Cosmos-1X Dataset: Created a dataset with approximately 200 hours of first-person videos, including navigation, folding clothes, and cleaning a desk.
- Bridge Dataset: Used a public dataset containing approximately 20,000 third-person perspective videos, showing a robotic arm performing various tasks in a kitchen environment.
- Real-World Driving Scene (RDS) Dataset: Created an internal dataset with approximately 360 million 20-second ring-camera video clips.
- Multi-Angle World Model: Fine-tuned the Cosmos-1.0-Diffusion-7B-Text2World model using RDS data and created a multi-angle world model.
In summary, everything in the physical simulation world can be generated through NVIDIA’s Cosmos.
Translation
本文讨论了一种新型分词器,称为Cosmos,其性能远超目前所有已知的分词器。研究人员在Cosmos上进行微调并创造了一个能够适用于各种物理AI任务的后训练模型,WFM(World Foundation Model)。该模型可以从文本/图像输入生成详细的视频,并预测场景的演变。
Cosmos及其WFM模型在以下方面表现出色:
- 性能: 它的性能比目前所有已知的分词器都要好,运行速度快达12倍。
- 生成高质量的视频: 可以从文本/图像输入数据生成详细的视频。
- 预测场景的演变: 能够预测场景的变化,帮助开发人员想象和模拟未来的场景。
- 防护系统: 提供了一个功能强大的防护系统,以保护开发人员免受有害输入或输出的影响。
此外,本文还讨论了以下内容:
- 基于指令的视频预测:研究人员在论文中展示了两个任务,分别是基于指令的视频预测以及基于动作的下一帧预测。
- Cosmos-1X数据集: 创建了一个包含大约200小时第一视角视频的数据集,包括导航、折叠衣物、清洁桌面等任务。
- Bridge数据集: 使用了一个公开数据集,包含大约20,000个第三人称视角的视频,展示了机器人手臂在厨房环境中执行不同任务的过程。
- 真实驾驶场景RDS数据集: 创建了一个内部数据集,包含大约360万个20秒的环视视频片段。
- 多视角世界模型: 使用RDS数据集对Cosmos-1.0-Diffusion-7B-Text2World模型进行微调,并打造出了一个多视角的世界模型。
总之,物理模拟世界的一切都可以通过英伟达Cosmos来生成出来。
Reference:
https://arxiv.org/pdf/2501.03575