🗺️♟️🤖🤝 Parables on the Power of Planning in AI: From Poker to Diplomacy: Noam Brown (OpenAI)

📝🐒 Human Notes

🧭 planning significantly improves machine learning system performance
🧠 planning in machine learning systems can be implemented via e.g. 🎲 Monte Carlo search
🎯 planning is useful in domains with a gap that’s ⛰️ large between the generator and the verifier
🤝 the generator-verifier refers to problems where it’s much easier to ✅ verify a solution than it is to ⚙️ generate one. 🧐 This type of problem presents unique challenges.

🤖 AI Summary

Scaling AI Through Search and Planning: Lessons from Game AI Research

✨ Introduction

🧑‍💻 Noam Brown, a leading AI researcher at OpenAI, presents insights from his work on game-playing AI, particularly in 🃏 poker and strategy games like Diplomacy. 🗣️ His talk explores the 🧠 power of search and planning in AI, demonstrating that 📈 increasing inference compute—rather than just scaling model size—can lead to 🚀 dramatic improvements in performance.

📰 This blog post distills Brown’s key findings and discusses 🌍 how they apply beyond games, particularly in 🤖 large language models (LLMs) and real-world AI systems.

🤖 AI in Poker: From Claudico to Pluribus

⏳ Initial Challenges (2012-2015)

🤖 Early poker AI was based on 💾 precomputed strategies, without real-time adaptation.
🆚 The 2015 Brains vs. AI match saw AI Claudico 😭 lose to human professionals, revealing flaws in its approach.

💡 Breakthrough with Search and Planning (2017-2019)

🧑‍💻 Brown’s team introduced ⏱️ real-time search and strategic planning, improving decision-making.
🏆 Their 2017 AI system became the first to defeat human poker pros.
🤖 Pluribus (2019) achieved ✨ superhuman performance in 6-player poker, running on 💻 just 28 CPUs with a $150 training cost.

🔑 Key Lesson:

🧠 Strategic search massively outperforms naive scaling—a 💯 100,000x larger model would be needed to match the gains achieved by introducing search.

Further Reading:

🗣️ AI in Diplomacy: Cicero and Language-Based Strategy

🌍 Diplomacy, unlike poker, requires 🗣️ natural language negotiation. 🧑‍💻 Brown’s team built 🤖 Cicero, an AI that played at a 🥇 top human level, using:

💬 Dialogue-Conditional Action Models – Predicting 🤖 not only moves but also negotiation strategies.
🔎 Iterative Search – Refining plans by simulating 💭 what other players might believe and do.
🗣️ Language Model + Planning – Instead of instantly responding like ChatGPT, Cicero 🧠 strategically planned each message (often taking ⏱️ 10+ seconds per response).

🔑 Key Lesson:

🤝 AI communication and planning can be tightly integrated to create agents that 🤔 reason, negotiate, and adapt dynamically.

Further Reading:

📰 Cicero: AI That Masters Diplomacy (Meta)
📰 Diplomacy Game Theory (CS Research Paper)

❓ Why Planning Works: The Generator-Verifier Gap

🧑‍💻 Brown introduces the 🔍 Generator-Verifier Gap, where:

✅ Some problems are easier to verify than generate.
♟️ Example: Chess → Recognizing a winning board state is easy; 🗺️ finding the path to it is hard.
🤖 AI should 🔎 search multiple solutions, then verify which is best—instead of relying on a single direct output.

This principle applies to:

🧮 Math & Programming → Checking correctness is easier than solving from scratch.
✍️ Proof Generation → Verifying a proof is far easier than discovering it.

🔑 Key Lesson:

🧠 Search-based AI 🚀 leverages the verifier gap to find ✨ optimal solutions in complex domains.

⬆️ Scaling Compute: Consensus, Best-of-N, and Process Reward Models

🧑‍💻 Brown discusses techniques for 📈 scaling inference compute, particularly in 🤖 LLMs and mathematical reasoning:

🤝 1. Consensus Sampling

🤖 AI generates ➕ multiple solutions and selects the most common.
🤖 Used in Google’s Minerva, improving math accuracy from 📉 33.6% to 📈 50.3%.

🏆 2. Best-of-N Selection

🤖 AI generates 🔢 N solutions and picks the best 💰 using a reward model.
✅ Works well in structured domains like ♟️ chess and Sudoku, but 😭 struggles when 📉 reward models are weak.

✅ 3. Process Reward Models (Step-by-Step Verification)

🤖 Instead of verifying only the final answer, AI ✅ evaluates each intermediate step.
🚀 Boosted math accuracy to 📈 78.2%, surpassing all previous techniques.

🔑 Key Lesson:

🤖 Future AI will 📈 scale inference compute dynamically, balancing ⏱️ speed and correctness for different tasks.

Further Reading:

📰 Minerva: Solving Math with LLMs (Google Research)
📰 Scaling Laws for Reward Models (DeepMind)

🔮 Future of AI: Rethinking Inference vs. Training Costs

🧑‍💻 Brown argues that:

📉 Current AI is biased toward low-cost inference (ChatGPT generates instant responses).
⚖️ Some tasks justify high inference compute (e.g., theorem proving, drug discovery).
❓ General methods for scaling inference are still an open research challenge.

🧭 Practical Implications

🤖 LLMs should incorporate search and planning (beyond simple token prediction).
🏫 Academia should explore high-compute inference strategies, as industry favors 💸 low-cost, high-speed AI.

Further Reading:

📰 The Bitter Lesson (Rich Sutton)

🚀 Conclusion: The Future is Search + Learning

🧑‍💻 Brown concludes with a reminder that ✨ AI progress has always been driven by two scalable methods:

🧠 Deep Learning (Learning from data)
🧭 Search & Planning (Efficient inference computation)

While deep learning has seen explosive growth, 🧭 search-based methods remain underutilized. 🤖 Future AI will likely ➕ combine both approaches to unlock ✨ new capabilities in reasoning, problem-solving, and decision-making.

✅ Final Takeaways

✅ Scaling inference compute is as important as scaling training compute.
✅ Search & planning improve AI performance across many domains.
✅ Future AI should balance speed vs. accuracy dynamically.

bagrounds.org

Table of Contents

🗺️♟️🤖🤝 Parables on the Power of Planning in AI: From Poker to Diplomacy: Noam Brown (OpenAI)

📝🐒 Human Notes

🤖 AI Summary

✨ Introduction

🤖 AI in Poker: From Claudico to Pluribus

⏳ Initial Challenges (2012-2015)

💡 Breakthrough with Search and Planning (2017-2019)

🔑 Key Lesson:

🗣️ AI in Diplomacy: Cicero and Language-Based Strategy

🔑 Key Lesson:

❓ Why Planning Works: The Generator-Verifier Gap

🔑 Key Lesson:

⬆️ Scaling Compute: Consensus, Best-of-N, and Process Reward Models

🤝 1. Consensus Sampling

🏆 2. Best-of-N Selection

✅ 3. Process Reward Models (Step-by-Step Verification)

🔑 Key Lesson:

🔮 Future of AI: Rethinking Inference vs. Training Costs

🧭 Practical Implications

🚀 Conclusion: The Future is Search + Learning

✅ Final Takeaways

Graph View

Backlinks