๐บ๏ธโ๏ธ๐ค๐ค Parables on the Power of Planning in AI: From Poker to Diplomacy: Noam Brown (OpenAI)
๐๐ Human Notes
- ๐งญ planning significantly improves machine learning system performance
- ๐ง planning in machine learning systems can be implemented via e.g. ๐ฒ Monte Carlo search
- ๐ฏ planning is useful in domains with a gap thatโs โฐ๏ธ large between the generator and the verifier
- ๐ค the generator-verifier refers to problems where itโs much easier to โ verify a solution than it is to โ๏ธ generate one. ๐ง This type of problem presents unique challenges.
๐ค AI Summary
Scaling AI Through Search and Planning: Lessons from Game AI Research
โจ Introduction
๐งโ๐ป Noam Brown, a leading AI researcher at OpenAI, presents insights from his work on game-playing AI, particularly in ๐ poker and strategy games like Diplomacy. ๐ฃ๏ธ His talk explores the ๐ง power of search and planning in AI, demonstrating that ๐ increasing inference computeโrather than just scaling model sizeโcan lead to ๐ dramatic improvements in performance.
๐ฐ This blog post distills Brownโs key findings and discusses ๐ how they apply beyond games, particularly in ๐ค large language models (LLMs) and real-world AI systems.
๐ค AI in Poker: From Claudico to Pluribus
โณ Initial Challenges (2012-2015)
- ๐ค Early poker AI was based on ๐พ precomputed strategies, without real-time adaptation.
- ๐ The 2015 Brains vs. AI match saw AI Claudico ๐ญ lose to human professionals, revealing flaws in its approach.
๐ก Breakthrough with Search and Planning (2017-2019)
- ๐งโ๐ป Brownโs team introduced โฑ๏ธ real-time search and strategic planning, improving decision-making.
- ๐ Their 2017 AI system became the first to defeat human poker pros.
- ๐ค Pluribus (2019) achieved โจ superhuman performance in 6-player poker, running on ๐ป just 28 CPUs with a $150 training cost.
๐ Key Lesson:
๐ง Strategic search massively outperforms naive scalingโa ๐ฏ 100,000x larger model would be needed to match the gains achieved by introducing search.
Further Reading:
- ๐ฐ Libratus: The AI that Beat Humans in Heads-Up No-Limit Poker (CMU)
- ๐ฐ Pluribus: Superhuman AI for Multiplayer Poker (Science)
๐ฃ๏ธ AI in Diplomacy: Cicero and Language-Based Strategy
๐ Diplomacy, unlike poker, requires ๐ฃ๏ธ natural language negotiation. ๐งโ๐ป Brownโs team built ๐ค Cicero, an AI that played at a ๐ฅ top human level, using:
- ๐ฌ Dialogue-Conditional Action Models โ Predicting ๐ค not only moves but also negotiation strategies.
- ๐ Iterative Search โ Refining plans by simulating ๐ญ what other players might believe and do.
- ๐ฃ๏ธ Language Model + Planning โ Instead of instantly responding like ChatGPT, Cicero ๐ง strategically planned each message (often taking โฑ๏ธ 10+ seconds per response).
๐ Key Lesson:
๐ค AI communication and planning can be tightly integrated to create agents that ๐ค reason, negotiate, and adapt dynamically.
Further Reading:
โ Why Planning Works: The Generator-Verifier Gap
๐งโ๐ป Brown introduces the ๐ Generator-Verifier Gap, where:
- โ Some problems are easier to verify than generate.
- โ๏ธ Example: Chess โ Recognizing a winning board state is easy; ๐บ๏ธ finding the path to it is hard.
- ๐ค AI should ๐ search multiple solutions, then verify which is bestโinstead of relying on a single direct output.
This principle applies to:
- ๐งฎ Math & Programming โ Checking correctness is easier than solving from scratch.
- โ๏ธ Proof Generation โ Verifying a proof is far easier than discovering it.
๐ Key Lesson:
๐ง Search-based AI ๐ leverages the verifier gap to find โจ optimal solutions in complex domains.
โฌ๏ธ Scaling Compute: Consensus, Best-of-N, and Process Reward Models
๐งโ๐ป Brown discusses techniques for ๐ scaling inference compute, particularly in ๐ค LLMs and mathematical reasoning:
๐ค 1. Consensus Sampling
- ๐ค AI generates โ multiple solutions and selects the most common.
- ๐ค Used in Googleโs Minerva, improving math accuracy from ๐ 33.6% to ๐ 50.3%.
๐ 2. Best-of-N Selection
- ๐ค AI generates ๐ข N solutions and picks the best ๐ฐ using a reward model.
- โ Works well in structured domains like โ๏ธ chess and Sudoku, but ๐ญ struggles when ๐ reward models are weak.
โ 3. Process Reward Models (Step-by-Step Verification)
- ๐ค Instead of verifying only the final answer, AI โ evaluates each intermediate step.
- ๐ Boosted math accuracy to ๐ 78.2%, surpassing all previous techniques.
๐ Key Lesson:
๐ค Future AI will ๐ scale inference compute dynamically, balancing โฑ๏ธ speed and correctness for different tasks.
Further Reading:
- ๐ฐ Minerva: Solving Math with LLMs (Google Research)
- ๐ฐ Scaling Laws for Reward Models (DeepMind)
๐ฎ Future of AI: Rethinking Inference vs. Training Costs
๐งโ๐ป Brown argues that:
- ๐ Current AI is biased toward low-cost inference (ChatGPT generates instant responses).
- โ๏ธ Some tasks justify high inference compute (e.g., theorem proving, drug discovery).
- โ General methods for scaling inference are still an open research challenge.
๐งญ Practical Implications
- ๐ค LLMs should incorporate search and planning (beyond simple token prediction).
- ๐ซ Academia should explore high-compute inference strategies, as industry favors ๐ธ low-cost, high-speed AI.
Further Reading:
๐ Conclusion: The Future is Search + Learning
๐งโ๐ป Brown concludes with a reminder that โจ AI progress has always been driven by two scalable methods:
- ๐ง Deep Learning (Learning from data)
- ๐งญ Search & Planning (Efficient inference computation)
While deep learning has seen explosive growth, ๐งญ search-based methods remain underutilized. ๐ค Future AI will likely โ combine both approaches to unlock โจ new capabilities in reasoning, problem-solving, and decision-making.
โ Final Takeaways
โ
Scaling inference compute is as important as scaling training compute.
โ
Search & planning improve AI performance across many domains.
โ
Future AI should balance speed vs. accuracy dynamically.