Weak to strong generalization

Here is the translation:

“This paper mainly focuses on the core idea of super-alignment, which is to train a human-level model and then have that model train a model stronger than humans. The author reverses the previous alignment idea by using weak human data to train strong models. This raises several questions:

At what point does the model break through human levels?
Does it come first with a stronger aligning model or with a stronger baseline model and then align? These two schemes are quite different.
How to judge whether the training reward model exists overfitting, and how to evaluate the effectiveness of the reward modeling?

In short, this paper proposes a new idea but still needs more research to answer these questions.”

Translation

这篇论文主要是关于超级对齐的核心思想，即训练出一个人类水平的模型，然后让该模型训练出强于人类的模型。作者反转了之前对齐的思想，使用弱人类数据来训练强模型。这引发了几个问题：

模型在哪个时刻突破了人类的水平呢？
是先有更强的对齐模型还是有了更强的基线模型再对齐？这两个方案的差别挺大的。
如何判断训练用的奖励模型是否存在过度优化，以及如何评估奖励建模的效果？

总之，这篇论文提出了一个新的想法，但还需要更多研究来解答这些问题。

Reference:

https://cdn.openai.com/papers/weak-to-strong-generalization.pdf

LLM based agents: a survey

AI in 2024