AI: The Second Half (by Shunyu Yao @OpenAI)
Here is the translation of the contents into English:
-
Utility Problem: Despite AI having surpassed human capabilities, there hasn’t been a significant change in the world from an economic and GDP perspective. This is because the current assessment settings are out of sync with reality.
- Limitations of Assessment Settings:
- Assessments typically require automatic agents to complete tasks independently and receive corresponding rewards. However, in real life, agents often interact with humans throughout the task process.
- To resolve issues, back-and-forth communication is necessary, which is closer to real-world scenarios compared to traditional evaluation methods.
- Independence and Identical Distribution (i.i.d.) Assumption:
- Assessments often assume independent and identical distribution, meaning a test set of 500 tasks can be run independently, with metrics calculated for each task, and then averaged to obtain an overall indicator.
- However, in reality, solving tasks is not done in parallel but sequentially.
Yao Shunyu emphasizes that to overcome these challenges and problems, we need to adopt new development patterns, including:
- Developing novel assessment settings or tasks tailored to real-world practicality and using general methods to solve them.
- Or enhancing existing methods by adding novel components, then cycling through this process repeatedly.
This indicates that AI development has reached a critical juncture, shifting from training-focused to evaluation-centered. Reconsidering how we evaluate will be the key driver of AI’s continued growth and development.
Translation
姚顺雨关于AI发展“下半场”的观点主要涉及到几个方面:
-
效用问题:即使AI已经具备了超出人类的能力,但从经济和GDP的角度来看,世界并没有发生太大的变化。这是因为现有的评估设置与现实世界的实际情况存在差异。
- 评估设置的局限性:
- 评估通常要求自动运行一个Agent独立完成任务,然后获得相应的任务奖励。但是在现实生活中,Agent往往需要在整个任务过程中与人类进行互动。
- 要来回沟通几次,以解决问题,这是与传统的评估方式相比更贴近现实场景。
- 独立同分布(i i d)的假设:
- 评估往往要求在独立同分布的情况下进行,即假设有一个包含500个任务的测试集,独立运行每个任务,计算每个任务的指标,然后取平均值得到一个整体指标。
- 然而,在现实世界中,解决任务的方式并不是并行的,而是顺序进行的。
姚顺雨强调,为了解决这些问题和挑战,我们需要采用新的发展模式。这包括:
- 为现实世界的实用性去开发新颖的评估设置或任务,然后用通用的方法去解决这些任务。
- 或者通过添加新颖的组件来增强这些方法,然后再不断循环这个过程。
这表明,AI的发展正处于一个关键的转折点,从注重训练到重视评估,如何重新思考评估的方式,将是推动AI持续发展的关键所在。
Reference:
https://ysymyth.github.io/The-Second-Half/