🔗🧠 Internal Linking: Teaching a Knowledge Base to Weave Its Own Web
🎯 The Problem
🤔 Imagine you have a personal knowledge base with thousands of notes — book reports, topic summaries, software docs, people profiles — and your daily reflections reference these concepts constantly.
📝 Sometimes you mention “Thinking, Fast and Slow” in a book review, but forget to link it to your actual book report page.
🕸️ These missed connections mean readers (and your future self) lose the opportunity to discover related content through the natural web of your writing.
🏗️ The Architecture
🎨 We designed a hybrid approach combining deterministic text matching with AI validation:
| 🔧 Phase | 📋 Description | 🎯 Goal |
|---|---|---|
| 📇 Index Building | 🔍 Scan all content directories for pages with titles | 📊 Build the knowledge graph vocabulary |
| 🌊 BFS Traversal | 🚶 Walk the link graph from the most recent reflection | 🗺️ Prioritize recently-relevant content |
| 🎭 Masking | 🛡️ Replace protected regions (frontmatter, code, existing links, headings) with spaces | 🚫 Prevent false matches in structured content |
| 🔎 Candidate Discovery | ✨ Word-boundary regex matching of plain titles against masked content | 📋 Find potential link insertion points |
| 🤖 AI Validation | 🧪 Gemini reviews each candidate in context | ✅ Ensure every link is correct |
| 🔗 Replacement | 📝 Insert wikilinks with emoji-rich aliases | 🎀 Maintain the knowledge base aesthetic |
🧩 Key Design Decisions
🛡️ Correctness Over Coverage
🏆 The most important principle: it’s better to miss link opportunities than to insert broken or nonsensical links.
🔒 Every safeguard serves this principle:
- 📏 Minimum title length of 8 characters filters out short false positives
- 🔤 Word boundary matching prevents partial title matches
- 🎭 Protected region masking prevents matches inside existing links, code blocks, headings, and frontmatter
- 📖 Only first match per target per file (conservative linking)
- 🤖 Gemini validates every candidate before insertion
- ❌ If Gemini fails or errors, the entire file is skipped (no unvalidated changes)
🌊 BFS Strategy
🧭 Starting from the most recent daily reflection and following links breadth-first means:
- 📅 The freshest content gets processed first
- 🔗 Content reachable from reflections (the core of the knowledge graph) is prioritized
- 🌐 The entire connected component is eventually reachable through the reflection doubly-linked list
🤖 AI as Validator, Not Discoverer
🧪 Gemini doesn’t search for links — it only validates candidates found deterministically.
💡 This gives us:
- 🎯 Precise control over what gets proposed (no hallucinated links)
- ⚡ Efficient API usage (small prompts with just the candidates + context)
- 🔄 Reproducible behavior (deterministic discovery, AI only says yes/no)
🔍 Single-Word Title Safety
⚠️ Without AI validation, single-word titles like “Engineering” or “Philosophy” match too broadly.
🛡️ Solution: when running without Gemini, only multi-word titles are eligible for matching.
📊 With Gemini, even single-word matches are considered — the AI validates whether “engineering” in “biological engineering” actually refers to the Engineering topic page (it doesn’t).
🔧 The Implementation
📇 Content Index
🏗️ The index scans 10 content directories (books, articles, topics, software, people, products, games, videos, presentations, tools) and extracts:
- 📁 Relative file path
- 🏷️ Full emoji title from frontmatter
- 📝 Plain title with emojis stripped
🎭 Protected Region Masking
🛡️ A clever technique: replace protected regions with equal-length space strings.
✨ This preserves character positions so that match indices map directly back to the original content — no complex offset tracking needed.
📝 Wikilink Format
🔗 Links use the Obsidian wikilink format with path aliases:
[[books/thinking-fast-and-slow|🤔🐇🐢 Thinking, Fast and Slow]]
🎀 The emoji-rich alias ensures the link displays beautifully in both Obsidian and the published site.
⚙️ The Workflow
🕐 The GitHub Action runs daily at approximately 11:30 PM Pacific time.
🎛️ It’s also manually triggerable with configurable parameters:
- 📊
max_files— Maximum number of files to process (default: 10) - 🔍
dry_run— Preview mode that logs candidates without writing changes
🔄 After making changes, modified files are synced to the Obsidian vault via the headless sync mechanism.
📊 Results
🧪 Running a dry-run across 30 files found candidates like:
- ✅ “Large Language Models” →
topics/large-language-models.md - ✅ “Jonathan Haidt” →
people/jonathan-haidt.md - ✅ “Cal Newport” →
people/cal-newport.md - ✅ “software engineering” →
topics/software-engineering.md - ✅ “Deep Learning” →
books/deep-learning.md
🎯 All high-confidence matches that would enrich the knowledge graph.
🧪 Testing
📊 89 new tests across 19 test suites covering:
- 😀 Emoji stripping (ZWJ sequences, skin tones, flags)
- 🔤 Regex escaping
- 🔗 Wikilink formatting
- 📇 Content index building
- 🌊 BFS traversal
- 🎭 Protected region masking
- 🔎 Candidate discovery with overlap prevention
- 📝 Replacement application with position preservation
- 🤖 AI validation prompt building
- 🔒 Single-word title safety filtering
🌟 What’s Next
🚀 Potential enhancements for future iterations:
- 📈 Fuzzy matching for titles that don’t appear verbatim
- 🔄 Bidirectional link detection (if A mentions B, should B link to A?)
- 📊 Link density analysis to avoid over-linking
- 🧪 A/B testing of link insertion strategies