DeepSeek: Engram architecture

Summary:
The DeepSeek team proposed a novel architecture called Engram, aiming to enhance the efficiency and performance of large-scale models by decoupling local pattern reconstruction from global reasoning tasks. Engram introduces a Conditional Memory Module, storing static knowledge in a parameterized memory table, thereby reducing the routing costs and training instability of traditional MoE models at ultra-large scales. Experiments show that Engram outperforms traditional MoE models in long-context processing and complex retrieval tasks (such as the RULER benchmark), with significant advantages in inference efficiency and parameter scalability. This architecture not only supports efficient pre-training and deployment but also enables future knowledge updates and dynamic learning.

Key Points:

Core Idea of Engram Architecture
- Decoupling local pattern reconstruction (static knowledge) from global reasoning tasks to reduce computational burden on the Transformer backbone.
- Utilizing a parameterized memory table (Memory Table) to store static knowledge, enabling O(1) time complexity for direct retrieval.
Technological Innovations
- Memory Module: Transferring local dependency modeling tasks to a dedicated memory module, freeing the attention mechanism for global processing.
- Context-Aware Gating: Dynamically activating memory modules based on input, for example, when encountering proper nouns (e.g., “Alexander the Great”) or fixed titles (e.g., “Princess of Wales”), the gating value significantly increases.
- Lightweight Optimization: Enhancing efficiency through components like tokenizer compression and branch-specific fusion, reducing validation loss.
Performance Advantages
- Training Efficiency: Engram-27B achieves MoE-27B performance with only 82% of the pre-training computational resources, even surpassing it on the RULER benchmark.
- Inference Efficiency: Even after offloading a 1000-billion-parameter memory table to host memory, the inference throughput on H800 hardware decreases by less than 3%.
- Long-Context Processing: Engram outperforms traditional architectures in tasks with 32,768 token long-context.
Experimental Results
- Model Comparison: Engram-27B significantly outperforms MoE-27B in multi-query tasks on the RULER benchmark.
- Knowledge Storage: Shielding the Engram module reduces factual knowledge task performance to 29-44%, proving its critical role as a parameterized knowledge repository.
- Scalability: Engram-40B further reduces pre-training loss, with the training loss gap still widening, indicating the memory capacity has not fully unleashed its potential.
Engineering and Application Value
- Deployment Advantages: The deterministic access pattern of the memory module supports efficient pre-fetching and hardware optimization, suitable for large-scale deployment.
- Knowledge Updating: Future knowledge errors can be corrected by directly modifying the memory table, eliminating the need for costly fine-tuning.
- Future Directions: Exploring online learning or dynamic updating memory modules to enable models to acquire new knowledge in real time.

References/Technologies

RULER Benchmark: Used to evaluate models’ ability to handle long-range dependencies in complex retrieval tasks.
DeepSeek-V3 and YaRN Technology: Expanding context windows to 32,768 tokens enhances models’ understanding of long texts.
Comparison with Traditional MoE: Engram addresses MoE’s high routing costs and training instability at ultra-large scales.
Comparison with RAG (Retrieval-Augmented Generation): Engram internalizes knowledge into parameter tables, offering lower latency and stronger knowledge consistency compared to external database retrieval.

Note: This is an interpretation of the DeepSeek paper, not directly citing the original references, but covering its core innovations and experimental conclusions.

Translation

总结：
DeepSeek团队提出了一种名为Engram的新型架构，旨在通过解耦局部模式重建与全局推理任务，提升大规模模型的效率与性能。Engram通过引入条件记忆模块（Conditional Memory Module），将静态知识存储在参数化的记忆表中，从而减少传统MoE模型在超大规模下的路由成本和训练不稳定性。实验表明，Engram在长上下文处理、复杂检索任务（如RULER基准）中表现优于传统MoE模型，且在推理效率和参数扩展性上具有显著优势。该架构不仅支持高效预训练和部署，还为未来知识更新和动态学习提供了可能性。

关键点：

Engram架构的核心思想
- 通过解耦局部模式重建（静态知识）与全局推理任务，减少Transformer骨干网络的计算负担。
- 利用参数化记忆表（Memory Table）存储静态知识，实现O(1)时间复杂度的直接检索。
技术创新
- 记忆模块：将局部依赖建模任务转移至专门的记忆模块，释放注意力机制的全局处理能力。
- 上下文感知门控：根据输入动态激活记忆模块，例如专有名词（如“亚历山大大帝”）或固定称谓（如“威尔士王妃”）时，门控值显著提升。
- 轻量级优化：通过分词器压缩、分支特定融合等组件提升效率，减少验证损失。
性能优势
- 训练效率：Engram-27B仅需82%的预训练计算量即可达到MoE-27B的性能，甚至在RULER基准上实现超越。
- 推理效率：即使卸载1000亿参数的记忆表至主机内存，H800硬件的推理吞吐量下降不足3%。
- 长上下文处理：在32768个token的长上下文任务中，Engram比传统架构表现更优。
实验结果
- 模型对比：Engram-27B在RULER基准的多查询任务中准确率显著领先于MoE-27B。
- 知识存储：屏蔽Engram模块后，事实性知识任务性能下降至29-44%，证明其作为参数化知识仓库的关键作用。
- 扩展性：Engram-40B进一步降低预训练损失，且训练损失差距仍在扩大，表明记忆容量尚未完全释放潜力。
工程与应用价值
- 部署优势：记忆模块的确定性访问模式支持高效预读取和硬件优化，适合大规模部署。
- 知识更新：未来可通过直接修改记忆表修正模型知识错误，无需昂贵微调。
- 未来方向：探索在线学习或动态更新的记忆模块，使模型实时获取新知识。

参考文献/技术

RULER基准：用于评估模型在复杂检索任务中的长程依赖处理能力。
DeepSeek-V3与YaRN技术：通过扩展上下文窗口至32768个token，增强模型对长文本的理解。
传统MoE对比：Engram解决了MoE在超大规模下的路由成本高、训练不稳定问题。
RAG（Retrieval-Augmented Generation）对比：Engram内化知识至参数表，相比外部数据库检索具有更低延迟和更强知识一致性。

注：本文为对DeepSeek论文的解读，未直接引用原文参考文献，但涵盖其核心创新点与实验结论。

Reference:

https://github.com/deepseek-ai/Engram

Elon Musk on AGI Timeline

DeepSeek papers 2026