Kimi K2 is a large model with 1 trillion total parameters and 320 billion active parameters, utilizing a hybrid expert (MoE) architecture. It employs the MuonClip optimizer with QK-Clip technology to address training issues, ensuring stability. The model’s pre-training phase incorporates synthetic data generation strategies to enhance token utilization efficiency, particularly in knowledge and mathematical domains, significantly improving accuracy. With a hidden dimension of 7168 and expert hidden dimension of 2048, it optimizes sparse scaling laws to adjust the number of experts and attention head configurations, enhancing long-context processing efficiency. Training infrastructure is based on NVIDIA H800 GPU clusters, utilizing flexible parallel strategies to improve efficiency. Post-training combines supervised fine-tuning (SFT) and reinforcement learning (RL) to strengthen tool usage capabilities, excelling in code execution and mathematical reasoning. Performance surpasses many open-source and closed-source models, becoming the top open-source model on LMSYS Arena. The Kimi K2-Base base model demonstrates strong performance in multilingual tasks with high safety evaluation pass rates. However, it has limitations such as redundant outputs in complex reasoning tasks and incomplete tool calls, requiring further optimization.

Key points:

  1. Kimi K2 adopts a MoE architecture, dynamically activating 320 billion parameters, balancing performance and computational costs.
  2. The MuonClip optimizer combined with QK-Clip technology resolves attention logits explosion issues during training.
  3. Pre-training uses synthetic data generation strategies to enhance data diversity, significantly improving accuracy after rewriting.
  4. Model parameter configuration optimizes sparse scaling laws, enhancing expert numbers and long-context processing efficiency.
  5. Training is based on NVIDIA H8,00 GPU clusters, utilizing flexible parallel strategies and memory optimization techniques.
  6. Post-training combines SFT and RL to strengthen tool usage capabilities, particularly excelling in code execution and mathematical reasoning.
  7. Performance surpasses most open-source models, with some metrics approaching closed-source models like Claude 4 Sonnet.
  8. Kimi K2-Base demonstrates strong performance in multilingual tasks with high safety evaluation pass rates.
  9. Limitations include redundant outputs in complex reasoning tasks and incomplete tool calls, requiring further optimization.
  10. Facing competition from models like Qwen3, Kimi K2 needs continuous iteration to maintain its advantage.

Translation

Kimi K2是一款拥有1万亿总参数、320亿激活参数的混合专家(MoE)模型,通过创新技术在多个基准测试中表现优异,超越许多开源和闭源模型。其核心架构采用MoE设计,动态激活部分专家子网络,平衡规模与计算成本。训练中使用MuonClip优化器,引入QK-Clip技术解决注意力logits爆炸问题,确保训练稳定性。预训练阶段通过合成数据生成策略提升token利用效率,尤其在知识和数学领域进行改写,显著提升模型准确率。模型隐藏维度为7168,专家隐藏维度2048,通过调整稀疏性缩放定律优化专家数量和注意力头配置,提高长上下文处理效率。训练基础设施基于NVIDIA H800 GPU集群,采用灵活并行策略提升效率。后训练阶段结合监督微调(SFT)和强化学习(RL),增强工具使用能力。性能表现突出,在SWE-bench、Tau2-Bench、数学等领域得分领先,成为LMSYS Arena开源模型榜首。同时,Kimi K2-Base基础模型在多语言任务中表现优异,安全评估通过率高。但存在复杂推理任务输出冗余、工具调用不完整等不足,未来需优化。

关键点:

  1. Kimi K2采用MoE架构,动态激活320亿参数,兼顾性能与计算成本。
  2. MuonClip优化器结合QK-Clip技术,解决训练中注意力logits爆炸问题。
  3. 预训练通过合成数据生成策略提升数据多样性,改写后数据准确率显著提升。
  4. 模型参数配置优化稀疏性缩放定律,提升专家数量和长上下文处理效率。
  5. 训练基于NVIDIA H800 GPU集群,采用灵活并行策略及内存优化技术。
  6. 后训练阶段结合SFT与RL,增强工具使用能力,尤其在代码执行、数学推理表现突出。
  7. 性能超越多数开源模型,部分指标接近闭源模型如Claude 4 Sonnet。
  8. Kimi K2-Base在多语言任务中表现优异,安全评估通过率高。
  9. 存在复杂推理任务输出冗余、工具调用不完整等局限,需进一步优化。
  10. 面对Qwen3等新模型竞争,Kimi K2需持续迭代以保持优势。

Reference:

https://github.com/MoonshotAI/Kimi-K2/blob/main/tech_report.pdf


<
Previous Post
Chain of Thought Monitorability
>
Next Post
Anthropic’s Benjamin Mann interview about AGI in 2028