AI: The Second Half (by Shunyu Yao @OpenAI)

Here is the translation of the contents into English:

Utility Problem: Despite AI having surpassed human capabilities, there hasn’t been a significant change in the world from an economic and GDP perspective. This is because the current assessment settings are out of sync with reality.
Limitations of Assessment Settings:
- Assessments typically require automatic agents to complete tasks independently and receive corresponding rewards. However, in real life, agents often interact with humans throughout the task process.
- To resolve issues, back-and-forth communication is necessary, which is closer to real-world scenarios compared to traditional evaluation methods.
Independence and Identical Distribution (i.i.d.) Assumption:
- Assessments often assume independent and identical distribution, meaning a test set of 500 tasks can be run independently, with metrics calculated for each task, and then averaged to obtain an overall indicator.
- However, in reality, solving tasks is not done in parallel but sequentially.

Yao Shunyu emphasizes that to overcome these challenges and problems, we need to adopt new development patterns, including:

Developing novel assessment settings or tasks tailored to real-world practicality and using general methods to solve them.
Or enhancing existing methods by adding novel components, then cycling through this process repeatedly.

This indicates that AI development has reached a critical juncture, shifting from training-focused to evaluation-centered. Reconsidering how we evaluate will be the key driver of AI’s continued growth and development.

Translation

姚顺雨关于AI发展“下半场”的观点主要涉及到几个方面：

效用问题：即使AI已经具备了超出人类的能力，但从经济和GDP的角度来看，世界并没有发生太大的变化。这是因为现有的评估设置与现实世界的实际情况存在差异。
评估设置的局限性：
- 评估通常要求自动运行一个Agent独立完成任务，然后获得相应的任务奖励。但是在现实生活中，Agent往往需要在整个任务过程中与人类进行互动。
- 要来回沟通几次，以解决问题，这是与传统的评估方式相比更贴近现实场景。
独立同分布（i i d）的假设：
- 评估往往要求在独立同分布的情况下进行，即假设有一个包含500个任务的测试集，独立运行每个任务，计算每个任务的指标，然后取平均值得到一个整体指标。
- 然而，在现实世界中，解决任务的方式并不是并行的，而是顺序进行的。

姚顺雨强调，为了解决这些问题和挑战，我们需要采用新的发展模式。这包括：

为现实世界的实用性去开发新颖的评估设置或任务，然后用通用的方法去解决这些任务。
或者通过添加新颖的组件来增强这些方法，然后再不断循环这个过程。

这表明，AI的发展正处于一个关键的转折点，从注重训练到重视评估，如何重新思考评估的方式，将是推动AI持续发展的关键所在。

Reference:

https://ysymyth.github.io/The-Second-Half/

Richard Sutton vs David Silver: Welcome to the Era of Experience

Aidan Gomez interview: Scaling Limits Emerging, AI Use-cases with PMF & Life After Transformers