Home > Videos

๐Ÿ—บ๏ธโ™Ÿ๏ธ๐Ÿค–๐Ÿค Parables on the Power of Planning in AI: From Poker to Diplomacy: Noam Brown (OpenAI)

๐Ÿ“๐Ÿ’ Human Notes

  • ๐Ÿงญ planning significantly improves machine learning system performance
  • ๐Ÿง  planning in machine learning systems can be implemented via e.g. ๐ŸŽฒ Monte Carlo search
  • ๐ŸŽฏ planning is useful in domains with a gap thatโ€™s โ›ฐ๏ธ large between the generator and the verifier
  • ๐Ÿค the generator-verifier refers to problems where itโ€™s much easier to โœ… verify a solution than it is to โš™๏ธ generate one. ๐Ÿง This type of problem presents unique challenges.

๐Ÿค– AI Summary

Scaling AI Through Search and Planning: Lessons from Game AI Research

โœจ Introduction

๐Ÿง‘โ€๐Ÿ’ป Noam Brown, a leading AI researcher at OpenAI, presents insights from his work on game-playing AI, particularly in ๐Ÿƒ poker and strategy games like Diplomacy. ๐Ÿ—ฃ๏ธ His talk explores the ๐Ÿง  power of search and planning in AI, demonstrating that ๐Ÿ“ˆ increasing inference computeโ€”rather than just scaling model sizeโ€”can lead to ๐Ÿš€ dramatic improvements in performance.

๐Ÿ“ฐ This blog post distills Brownโ€™s key findings and discusses ๐ŸŒ how they apply beyond games, particularly in ๐Ÿค– large language models (LLMs) and real-world AI systems.

๐Ÿค– AI in Poker: From Claudico to Pluribus

โณ Initial Challenges (2012-2015)

  • ๐Ÿค– Early poker AI was based on ๐Ÿ’พ precomputed strategies, without real-time adaptation.
  • ๐Ÿ†š The 2015 Brains vs. AI match saw AI Claudico ๐Ÿ˜ญ lose to human professionals, revealing flaws in its approach.

๐Ÿ’ก Breakthrough with Search and Planning (2017-2019)

  • ๐Ÿง‘โ€๐Ÿ’ป Brownโ€™s team introduced โฑ๏ธ real-time search and strategic planning, improving decision-making.
  • ๐Ÿ† Their 2017 AI system became the first to defeat human poker pros.
  • ๐Ÿค– Pluribus (2019) achieved โœจ superhuman performance in 6-player poker, running on ๐Ÿ’ป just 28 CPUs with a $150 training cost.

๐Ÿ”‘ Key Lesson:

๐Ÿง  Strategic search massively outperforms naive scalingโ€”a ๐Ÿ’ฏ 100,000x larger model would be needed to match the gains achieved by introducing search.

Further Reading:

๐Ÿ—ฃ๏ธ AI in Diplomacy: Cicero and Language-Based Strategy

๐ŸŒ Diplomacy, unlike poker, requires ๐Ÿ—ฃ๏ธ natural language negotiation. ๐Ÿง‘โ€๐Ÿ’ป Brownโ€™s team built ๐Ÿค– Cicero, an AI that played at a ๐Ÿฅ‡ top human level, using:

  1. ๐Ÿ’ฌ Dialogue-Conditional Action Models โ€“ Predicting ๐Ÿค– not only moves but also negotiation strategies.
  2. ๐Ÿ”Ž Iterative Search โ€“ Refining plans by simulating ๐Ÿ’ญ what other players might believe and do.
  3. ๐Ÿ—ฃ๏ธ Language Model + Planning โ€“ Instead of instantly responding like ChatGPT, Cicero ๐Ÿง  strategically planned each message (often taking โฑ๏ธ 10+ seconds per response).

๐Ÿ”‘ Key Lesson:

๐Ÿค AI communication and planning can be tightly integrated to create agents that ๐Ÿค” reason, negotiate, and adapt dynamically.

Further Reading:

โ“ Why Planning Works: The Generator-Verifier Gap

๐Ÿง‘โ€๐Ÿ’ป Brown introduces the ๐Ÿ” Generator-Verifier Gap, where:

  • โœ… Some problems are easier to verify than generate.
  • โ™Ÿ๏ธ Example: Chess โ†’ Recognizing a winning board state is easy; ๐Ÿ—บ๏ธ finding the path to it is hard.
  • ๐Ÿค– AI should ๐Ÿ”Ž search multiple solutions, then verify which is bestโ€”instead of relying on a single direct output.

This principle applies to:

  • ๐Ÿงฎ Math & Programming โ†’ Checking correctness is easier than solving from scratch.
  • โœ๏ธ Proof Generation โ†’ Verifying a proof is far easier than discovering it.

๐Ÿ”‘ Key Lesson:

๐Ÿง  Search-based AI ๐Ÿš€ leverages the verifier gap to find โœจ optimal solutions in complex domains.

โฌ†๏ธ Scaling Compute: Consensus, Best-of-N, and Process Reward Models

๐Ÿง‘โ€๐Ÿ’ป Brown discusses techniques for ๐Ÿ“ˆ scaling inference compute, particularly in ๐Ÿค– LLMs and mathematical reasoning:

๐Ÿค 1. Consensus Sampling

  • ๐Ÿค– AI generates โž• multiple solutions and selects the most common.
  • ๐Ÿค– Used in Googleโ€™s Minerva, improving math accuracy from ๐Ÿ“‰ 33.6% to ๐Ÿ“ˆ 50.3%.

๐Ÿ† 2. Best-of-N Selection

  • ๐Ÿค– AI generates ๐Ÿ”ข N solutions and picks the best ๐Ÿ’ฐ using a reward model.
  • โœ… Works well in structured domains like โ™Ÿ๏ธ chess and Sudoku, but ๐Ÿ˜ญ struggles when ๐Ÿ“‰ reward models are weak.

โœ… 3. Process Reward Models (Step-by-Step Verification)

  • ๐Ÿค– Instead of verifying only the final answer, AI โœ… evaluates each intermediate step.
  • ๐Ÿš€ Boosted math accuracy to ๐Ÿ“ˆ 78.2%, surpassing all previous techniques.

๐Ÿ”‘ Key Lesson:

๐Ÿค– Future AI will ๐Ÿ“ˆ scale inference compute dynamically, balancing โฑ๏ธ speed and correctness for different tasks.

Further Reading:

๐Ÿ”ฎ Future of AI: Rethinking Inference vs. Training Costs

๐Ÿง‘โ€๐Ÿ’ป Brown argues that:

  1. ๐Ÿ“‰ Current AI is biased toward low-cost inference (ChatGPT generates instant responses).
  2. โš–๏ธ Some tasks justify high inference compute (e.g., theorem proving, drug discovery).
  3. โ“ General methods for scaling inference are still an open research challenge.

๐Ÿงญ Practical Implications

  • ๐Ÿค– LLMs should incorporate search and planning (beyond simple token prediction).
  • ๐Ÿซ Academia should explore high-compute inference strategies, as industry favors ๐Ÿ’ธ low-cost, high-speed AI.

Further Reading:

๐Ÿš€ Conclusion: The Future is Search + Learning

๐Ÿง‘โ€๐Ÿ’ป Brown concludes with a reminder that โœจ AI progress has always been driven by two scalable methods:

  • ๐Ÿง  Deep Learning (Learning from data)
  • ๐Ÿงญ Search & Planning (Efficient inference computation)

While deep learning has seen explosive growth, ๐Ÿงญ search-based methods remain underutilized. ๐Ÿค– Future AI will likely โž• combine both approaches to unlock โœจ new capabilities in reasoning, problem-solving, and decision-making.

โœ… Final Takeaways

โœ… Scaling inference compute is as important as scaling training compute.
โœ… Search & planning improve AI performance across many domains.
โœ… Future AI should balance speed vs. accuracy dynamically.