Meta recently launched the V-JEPA 2, a video-trained world model aimed at achieving general artificial intelligence (AGI). This model has the ability to understand and predict environments, as well as zero-shot planning, enabling robots to perform complex tasks such as object grasping, placing, and household operations. V-JEPA 2, with 12 billion parameters, has shown significant improvements over its predecessor in tasks like robot control (e.g., 100% on the Reach benchmark, 45% on the Grasp benchmark), action prediction (e.g., 39.7% on the Epic-Kitchens-100 task), and visual understanding (e.g., 77.3% on the Something-Something v2 benchmark). Its innovation lies in a two-stage training approach: the first stage uses 1 million hours of video data for pre-training, and the second stage optimizes planning capabilities with just 62 hours of robot data. Additionally, Meta introduced three new benchmark tests (IntPhys 2, MVPBench, CausalVQA), revealing that current models still lag behind humans in tasks like physical reasoning and causal relationship understanding. Future plans include developing hierarchical JEPA models across time/space scales and multi-modal perception systems.

Key Points

  1. V-JEPA 2 Core Features: Enables environment understanding, prediction, and zero-shot planning through video training, supporting robots in performing complex tasks.
  2. Performance Breakthroughs: Demonstrates significant improvements over previous models in tasks such as robot control (e.g., 100% on the Reach benchmark, 45% on the Grasp benchmark), action prediction (e.g., 39.7% on the Epic-Kitchens-100 task), and visual understanding (e.g., 77.3% on the Something-Something v2 benchmark).
  3. Two-Stage Training: The first stage uses 1 million hours of video data for pre-training, and the second stage optimizes planning capabilities with just 62 hours of robot data.
  4. Multimodal Integration: Combines language models to enhance video question-answering performance and supports cross-modal task execution (e.g., visual sub-goals guiding long-term tasks).
  5. Benchmark Progress: Introduced three new tests (IntPhys 2, MVPBench, CausalVQA), revealing that current models still lag behind humans in tasks like physical reasoning and causal relationship understanding.
  6. Future Directions: Plans to develop hierarchical JEPA models across time/space scales and multi-modal perception systems (integrating vision, audition, and touch).

Translation

Okay, I need to process this user's query about the Meta V-JEPA 2 model. Let me start by reading through the document carefully to understand the key points. The user wants a summary in the same language as the article, which is Chinese, and within 50% of the original length. Also, they want key points added at the end without any preambles. First, the article talks about Meta's recent actions, like forming a team for AGI, acquiring Scale AI, and releasing V-JEPA 2. The main focus is on V-JEPA 2's capabilities in understanding and predicting environments, zero-shot planning, and robot control. Yann LeCun explains that world models help AI understand the physical world, which is different from language models. The model has 12 billion parameters and improved performance in tasks like reaching, grasping, and placing objects. It also includes benchmarks and future plans for multi-scale and multimodal models. I need to condense this into a few paragraphs. The summary should mention the key points: Meta's initiatives, V-JEPA 2's features, performance improvements, and the new benchmarks. Then list the key points separately. Make sure the summary is concise, under 50% of the original length. Check for any technical terms that need to be included, like the benchmarks (IntPhys 2, MVPBench, CausalVQA) and the comparison with human performance. Also, note the training methods and the use of the DROID dataset. Avoid any markdown and keep it in Chinese. Let me structure the summary first, then the key points. Ensure that the summary is in Chinese and the key points are bullet points or numbered. Wait, the user said "add key points of the article" after the summary. So after the summary paragraphs, list the key points. Make sure the summary is under 50% of the original text. Let me count the original length. The original is about 1,200 words. 50% would be around 600 words. The summary should be a few paragraphs, maybe three, each covering different aspects: introduction of V-JEPA 2, its features and performance, and the benchmarks and future plans. Then the key points would include the main components like the encoder and predictor, the two training stages, the benchmarks, and the future directions. Need to make sure all key points are covered without being too verbose. Also, check for any specific numbers mentioned, like the 12 billion parameters, 45% grasp score, etc. Alright, I think that's covered. Now, write the summary in Chinese, then the key points.

Meta近期推出基于视频训练的世界模型V-JEPA 2,旨在实现通用人工智能(AGI)。该模型具备环境理解、预测及零样本规划能力,可辅助机器人完成复杂任务,如物体抓取、放置和家务操作。V-JEPA 2参数量达12亿,相较前代模型在机器人控制(如Reach基准100%、Grasp基准45%)、动作预测(Epic-Kitchens-100任务39.7%)和视觉理解(Something-Something v2基准77.3%)等任务中表现显著提升。其创新点包括双阶段训练:无动作预训练阶段(利用100万小时视频数据)和动作条件训练阶段(仅需62小时机器人数据),结合语言模型实现视频问答的先进性能。此外,Meta发布三个新基准测试(IntPhys 2、MVPBench、CausalVQA),揭示当前模型在物理推理、因果关系理解等任务上仍落后于人类。未来计划开发跨时间尺度的分层JEPA模型及多模态感知系统。

关键点

  1. V-JEPA 2核心功能:通过视频训练实现环境理解、预测及零样本规划,支持机器人执行复杂任务。
  2. 性能突破:在机器人控制(Reach/Grasp/Pick-and-place)、动作预测(Epic-Kitchens-100)和视觉理解(Something-Something v2)等任务中超越前代模型。
  3. 双阶段训练:第一阶段利用100万小时视频数据预训练,第二阶段仅需62小时机器人数据优化规划能力。
  4. 多模态融合:结合语言模型提升视频问答性能,支持跨模态任务执行(如视觉子目标引导的长期任务)。
  5. 基准测试进展:发布IntPhys 2、MVPBench、CausalVQA三个测试,揭示模型在物理推理、因果关系理解等方向与人类存在差距。
  6. 未来方向:开发跨时间/空间尺度的分层JEPA模型及多模态感知系统(整合视觉、听觉、触觉)。

Reference:

https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks/, https://ai.meta.com/vjepa/, https://arxiv.org/pdf/2506.09985


<
Previous Post
Circle and USDC
>
Next Post
The State of Consumer Tech in the Age of AI - A16z roundtable