Weak to strong generalization
Here is the translation:
“This paper mainly focuses on the core idea of super-alignment, which is to train a human-level model and then have that model train a model stronger than humans. The author reverses the previous alignment idea by using weak human data to train strong models. This raises several questions:
- At what point does the model break through human levels?
- Does it come first with a stronger aligning model or with a stronger baseline model and then align? These two schemes are quite different.
- How to judge whether the training reward model exists overfitting, and how to evaluate the effectiveness of the reward modeling?
In short, this paper proposes a new idea but still needs more research to answer these questions.”
Translation
这篇论文主要是关于超级对齐的核心思想,即训练出一个人类水平的模型,然后让该模型训练出强于人类的模型。作者反转了之前对齐的思想,使用弱人类数据来训练强模型。这引发了几个问题:
- 模型在哪个时刻突破了人类的水平呢?
- 是先有更强的对齐模型还是有了更强的基线模型再对齐?这两个方案的差别挺大的。
- 如何判断训练用的奖励模型是否存在过度优化,以及如何评估奖励建模的效果?
总之,这篇论文提出了一个新的想法,但还需要更多研究来解答这些问题。
Reference:
https://cdn.openai.com/papers/weak-to-strong-generalization.pdf