V-JEPA 2: the next generation LWM?

Meta recently launched the V-JEPA 2, a video-trained world model aimed at achieving general artificial intelligence (AGI). This model has the ability to understand and predict environments, as well as zero-shot planning, enabling robots to perform complex tasks such as object grasping, placing, and household operations. V-JEPA 2, with 12 billion parameters, has shown significant improvements over its predecessor in tasks like robot control (e.g., 100% on the Reach benchmark, 45% on the Grasp benchmark), action prediction (e.g., 39.7% on the Epic-Kitchens-100 task), and visual understanding (e.g., 77.3% on the Something-Something v2 benchmark). Its innovation lies in a two-stage training approach: the first stage uses 1 million hours of video data for pre-training, and the second stage optimizes planning capabilities with just 62 hours of robot data. Additionally, Meta introduced three new benchmark tests (IntPhys 2, MVPBench, CausalVQA), revealing that current models still lag behind humans in tasks like physical reasoning and causal relationship understanding. Future plans include developing hierarchical JEPA models across time/space scales and multi-modal perception systems.

Key Points

V-JEPA 2 Core Features: Enables environment understanding, prediction, and zero-shot planning through video training, supporting robots in performing complex tasks.
Performance Breakthroughs: Demonstrates significant improvements over previous models in tasks such as robot control (e.g., 100% on the Reach benchmark, 45% on the Grasp benchmark), action prediction (e.g., 39.7% on the Epic-Kitchens-100 task), and visual understanding (e.g., 77.3% on the Something-Something v2 benchmark).
Two-Stage Training: The first stage uses 1 million hours of video data for pre-training, and the second stage optimizes planning capabilities with just 62 hours of robot data.
Multimodal Integration: Combines language models to enhance video question-answering performance and supports cross-modal task execution (e.g., visual sub-goals guiding long-term tasks).
Benchmark Progress: Introduced three new tests (IntPhys 2, MVPBench, CausalVQA), revealing that current models still lag behind humans in tasks like physical reasoning and causal relationship understanding.
Future Directions: Plans to develop hierarchical JEPA models across time/space scales and multi-modal perception systems (integrating vision, audition, and touch).

Translation

Meta近期推出基于视频训练的世界模型V-JEPA 2，旨在实现通用人工智能（AGI）。该模型具备环境理解、预测及零样本规划能力，可辅助机器人完成复杂任务，如物体抓取、放置和家务操作。V-JEPA 2参数量达12亿，相较前代模型在机器人控制（如Reach基准100%、Grasp基准45%）、动作预测（Epic-Kitchens-100任务39.7%）和视觉理解（Something-Something v2基准77.3%）等任务中表现显著提升。其创新点包括双阶段训练：无动作预训练阶段（利用100万小时视频数据）和动作条件训练阶段（仅需62小时机器人数据），结合语言模型实现视频问答的先进性能。此外，Meta发布三个新基准测试（IntPhys 2、MVPBench、CausalVQA），揭示当前模型在物理推理、因果关系理解等任务上仍落后于人类。未来计划开发跨时间尺度的分层JEPA模型及多模态感知系统。

关键点

V-JEPA 2核心功能：通过视频训练实现环境理解、预测及零样本规划，支持机器人执行复杂任务。
性能突破：在机器人控制（Reach/Grasp/Pick-and-place）、动作预测（Epic-Kitchens-100）和视觉理解（Something-Something v2）等任务中超越前代模型。
双阶段训练：第一阶段利用100万小时视频数据预训练，第二阶段仅需62小时机器人数据优化规划能力。
多模态融合：结合语言模型提升视频问答性能，支持跨模态任务执行（如视觉子目标引导的长期任务）。
基准测试进展：发布IntPhys 2、MVPBench、CausalVQA三个测试，揭示模型在物理推理、因果关系理解等方向与人类存在差距。
未来方向：开发跨时间/空间尺度的分层JEPA模型及多模态感知系统（整合视觉、听觉、触觉）。

Reference:

https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks/, https://ai.meta.com/vjepa/, https://arxiv.org/pdf/2506.09985

Circle and USDC

The State of Consumer Tech in the Age of AI - A16z roundtable