OpenAI: Weight-sparse transformers have interpretable circuits

Paragraph Summary

This paper proposes a training paradigm for large models centered on interpretability, using weight sparsity constraints to force the model to complete tasks with compact, modular circuits. Core innovations include:

Sparse Model Design: By limiting the sparsity of weights and activations, the model’s internal computational logic becomes more transparent, with nodes corresponding to natural concepts and connections being intuitive.
Bridging Method: Enables sparse models to explain the internal mechanisms of existing dense models, resolving the contradiction between interpretability and performance.
Experimental Validation: Verified the interpretability of sparse circuits in tasks such as string closure, nested depth judgment, and variable type tracking, and proved their logical correctness through adversarial samples.
Limitations and Future Directions: Although training efficiency is low and multi-semantic features remain unresolved, it opens a new path for interpretability research in large models. Future work will explore universal circuit motifs for larger models and automated interpretability techniques.

Key Points

Model Design
- Weight Sparsity Constraints: By limiting the number of non-zero parameters, the model is forced to complete tasks using modular circuits.
- Activation Sparsity: Nodes have more single-meaning functions, reducing concept overlap and enhancing interpretability.
- Bridging Method: Trains encoders/decoders between dense and sparse models to map activation spaces, making sparse models the “interpretable surrogate” of dense models.
Technical Details
- Loss Function Design: Includes normalized MSE and KL divergence (bidirectional) to ensure computational consistency between sparse and dense models.
- Adversarial Sample Validation: Validates circuit logic correctness by perturbing activation values (e.g., averaging in nested depth calculations).
- Mean Ablation Experiment: Fixes node activation values to the mean of pre-trained distributions to validate the necessity and sufficiency of circuits.
Experimental Analysis
- Task Cases:
  - String Closure: Validated the impact of sparse models on dense models by intervening in quote-type classifier channels.
  - Nested Depth Judgment: Designed adversarial samples using average activation values to prove circuit logic interpretability.
  - Variable Type Tracking: Achieved type information transfer and operator selection via two-hop attention heads.
- Performance Validation: Task performance significantly declined after ablating circuit nodes, proving their role as core computational paths.
Bridging Method
- Role: Enables sparse models to explain dense models’ internal mechanisms, providing interpretability tools for existing large models.
- Experimental Results: Intervening specific channels in sparse models caused significant changes in dense model output probabilities, validating circuit consistency.
Limitations
- Low Training Efficiency: Non-zero parameters lack structure, incompatible with GPU tensor core optimizations, resulting in 100-1000x slower training speeds compared to dense models.
- Multi-Semantic Features: In complex tasks, concepts may still slightly overlap on few nodes, failing to fully achieve single-meaningness.
- Non-Binarized Activations: Requires interpreting node activation strengths (e.g., averages in nested depth calculations), increasing interpretation difficulty.
Future Directions
- Interpretable Model Expansion: Explore universal circuit motifs for GPT-3-level sparse models to guide large model design.
- Task-Oriented Sparse Models: Train sparse bridging models for AI safety-related tasks (e.g., deception, rejection).
- Automated Interpretability: Combine sparse circuits’ simple computational expression language to break through current automated interpretation bottlenecks.

Reference Documents and Links

The document does not mention specific external references or links; the content is based on internal analysis of an OpenAI paper.

Note: The above summary and key points are derived from the document content without introducing external information.

Translation

段落总结

这篇论文提出了一种以可解释性为核心的大模型训练范式，通过权重稀疏约束迫使模型使用紧凑、模块化的电路完成任务。核心创新点包括：

稀疏模型设计：通过限制权重和激活的稀疏性，使模型内部计算逻辑更透明，节点对应自然概念，连接逻辑直观。
桥接方法：让稀疏模型解释已有稠密模型的内部机制，解决可解释性与性能的矛盾。
实验验证：在字符串闭合、嵌套深度判断、变量类型跟踪等任务中，验证了稀疏电路的可解释性，并通过对抗样本证明其逻辑正确性。
局限性与未来方向：尽管训练效率低、多义特征未完全解决，但为大模型可解释性研究开辟了新路径，未来将探索更大规模模型的通用电路基序和自动化可解释性技术。

关键点

模型设计
- 权重稀疏约束：通过限制非零参数数量，迫使模型使用模块化电路完成任务。
- 激活稀疏性：节点功能更单义，减少概念叠加，提升可解释性。
- 桥接方法：在稠密模型与稀疏模型间训练编码器/解码器，实现激活空间映射，使稀疏模型成为稠密模型的“可解释替身”。
技术细节
- 损失函数设计：包括归一化MSE、KL散度（双向），确保稀疏模型与稠密模型计算一致性。
- 对抗样本验证：通过干扰激活值（如嵌套深度计算中的平均值），验证电路逻辑的正确性。
- 均值消融实验：固定节点激活值为预训练分布均值，验证电路的必要性和充分性。
实验分析
- 任务案例：
  - 字符串闭合：通过干预引号类型分类器通道，验证稀疏模型对稠密模型的影响。
  - 嵌套深度判断：利用平均激活值设计对抗样本，证明电路逻辑的可解释性。
  - 变量类型跟踪：通过两跳注意力头协作，实现类型信息的传递与运算符选择。
- 性能验证：消融电路节点后，任务性能显著下降，证明其为核心计算路径。
桥接方法
- 作用：使稀疏模型能够解释稠密模型的内部机制，为现有大模型提供可解释性工具。
- 实验结果：干预稀疏模型的特定通道后，稠密模型输出概率显著变化，验证电路一致性。
局限性
- 训练效率低：非零参数无结构，与GPU张量核心优化不兼容，训练速度比稠密模型慢100-1000倍。
- 多义特征：复杂任务中概念仍可能在少数节点上轻微叠加，完全单义性未实现。
- 非二值化激活：需解释节点激活强度（如嵌套深度的平均值），增加解读难度。
未来方向
- 可解释模型扩展：探索GPT-3级别稀疏模型的通用电路基序，指导大模型设计。
- 任务定向稀疏模型：针对AI安全相关任务（如欺骗、拒绝）训练稀疏桥接模型。
- 自动化可解释性：结合稀疏电路的简单计算表达语言，突破当前自动化解释瓶颈。

参考文档与链接

文档中未提及具体外部参考文献或链接，内容基于OpenAI论文的内部分析。

注：以上总结和关键点基于文档内容提炼，未引入外部信息。

Reference:

https://cdn.openai.com/pdf/41df8f28-d4ef-43e9-aed2-823f9393e470/circuit-sparsity-paper.pdf

CEO of Microsoft AI:The Next 10 Years Will Change Humanity Forever

Ilya Sutskever – We’re moving from the age of scaling to the age of research