I-JEPA: A Human-Like world Model by Yann Lecun
Here is a summary of the contents in no more than 20% of the total contents length:
I-JEPA is an open-source computer vision model from Meta AI that aims to improve the semantic level of representations without using prior knowledge. It uses a self-supervised learning approach that predicts missing information in abstract representation space.
There are two common approaches for self-supervised learning: invariance-based and generative. I-JEPA’s approach is more human-like, predicting embeddings that grasp the semantics of an image rather than predicting explicit pixels.
I-JEPA has three components: a context encoder, a target encoder, and a predictor. The target encoder converts an input image into a sequence of non-overlapping patches, and the context block is sampled from the original image. The predictor then predicts the representations of the target blocks based on the context block.
The loss is calculated by the average L2 distance between the predictions and the actual representations, and the trained context encoder generates highly semantic representations for input images.
Translation
Reference:
https://arxiv.org/abs/2301.08243, https://www.youtube.com/watch?v=xL6Y0dpXEwc