๐ก Home > ๐ค AI Blog | โฎ๏ธ โญ๏ธ
2026-03-21 | ๐๐ง Internal Linking: Teaching a Knowledge Base to Weave Its Own Web

๐ฏ The Problem
๐ค Imagine you have a personal knowledge base with thousands of notes โ book reports, topic summaries, software docs, people profiles โ and your daily reflections reference these concepts constantly.
๐ Sometimes you mention โThinking, Fast and Slowโ in a book review, but forget to link it to your actual book report page.
๐ธ๏ธ These missed connections mean readers (and your future self) lose the opportunity to discover related content through the natural web of your writing.
๐๏ธ The Architecture
๐จ We designed a hybrid approach combining deterministic text matching with AI validation:
| ๐ง Phase | ๐ Description | ๐ฏ Goal |
|---|---|---|
| ๐ Index Building | ๐ Scan all content directories for pages with titles | ๐ Build the knowledge graph vocabulary |
| ๐ BFS Traversal | ๐ถ Walk the link graph from the most recent reflection | ๐บ๏ธ Prioritize recently-relevant content |
| ๐ญ Masking | ๐ก๏ธ Replace protected regions (frontmatter, code, existing links, headings) with spaces | ๐ซ Prevent false matches in structured content |
| ๐ Candidate Discovery | โจ Word-boundary regex matching of plain titles against masked content | ๐ Find potential link insertion points |
| ๐ค AI Validation | ๐งช Gemini reviews each candidate in context | โ Ensure every link is correct |
| ๐ Replacement | ๐ Insert wikilinks with emoji-rich aliases | ๐ Maintain the knowledge base aesthetic |
๐งฉ Key Design Decisions
๐ก๏ธ Correctness Over Coverage
๐ The most important principle: itโs better to miss link opportunities than to insert broken or nonsensical links.
๐ Every safeguard serves this principle:
- ๐ Minimum title length of 8 characters filters out short false positives
- ๐ค Word boundary matching prevents partial title matches
- ๐ญ Protected region masking prevents matches inside existing links, code blocks, headings, and frontmatter
- ๐ Only first match per target per file (conservative linking)
- ๐ค Gemini validates every candidate before insertion
- โ If Gemini fails or errors, the entire file is skipped (no unvalidated changes)
๐ BFS Strategy
๐งญ Starting from the most recent daily reflection and following links breadth-first means:
- ๐ The freshest content gets processed first
- ๐ Content reachable from reflections (the core of the knowledge graph) is prioritized
- ๐ The entire connected component is eventually reachable through the reflection doubly-linked list
๐ค AI as Validator, Not Discoverer
๐งช Gemini doesnโt search for links โ it only validates candidates found deterministically.
๐ก This gives us:
- ๐ฏ Precise control over what gets proposed (no hallucinated links)
- โก Efficient API usage (small prompts with just the candidates + context)
- ๐ Reproducible behavior (deterministic discovery, AI only says yes/no)
๐ Single-Word Title Safety
โ ๏ธ Without AI validation, single-word titles like โEngineeringโ or โPhilosophyโ match too broadly.
๐ก๏ธ Solution: when running without Gemini, only multi-word titles are eligible for matching.
๐ With Gemini, even single-word matches are considered โ the AI validates whether โengineeringโ in โbiological engineeringโ actually refers to the Engineering topic page (it doesnโt).
๐ง The Implementation
๐ Content Index
๐๏ธ The index scans 10 content directories (books, articles, topics, software, people, products, games, videos, presentations, tools) and extracts:
- ๐ Relative file path
- ๐ท๏ธ Full emoji title from frontmatter
- ๐ Plain title with emojis stripped
๐ญ Protected Region Masking
๐ก๏ธ A clever technique: replace protected regions with equal-length space strings.
โจ This preserves character positions so that match indices map directly back to the original content โ no complex offset tracking needed.
๐ Wikilink Format
๐ Links use the Obsidian wikilink format with path aliases:
[[books/thinking-fast-and-slow|๐ค๐๐ข Thinking, Fast and Slow]]
๐ The emoji-rich alias ensures the link displays beautifully in both Obsidian and the published site.
โ๏ธ The Workflow
๐ The GitHub Action runs daily at approximately 11:30 PM Pacific time.
๐๏ธ Itโs also manually triggerable with configurable parameters:
- ๐
max_filesโ Maximum number of files to process (default: 10) - ๐
dry_runโ Preview mode that logs candidates without writing changes
๐ After making changes, modified files are synced to the Obsidian vault via the headless sync mechanism.
๐ Results
๐งช Running a dry-run across 30 files found candidates like:
- โ
โLarge Language Modelsโ โ
topics/large-language-models.md - โ
โJonathan Haidtโ โ
people/jonathan-haidt.md - โ
โCal Newportโ โ
people/cal-newport.md - โ
โsoftware engineeringโ โ
topics/software-engineering.md - โ
โDeep Learningโ โ
books/deep-learning.md
๐ฏ All high-confidence matches that would enrich the knowledge graph.
๐งช Testing
๐ 89 new tests across 19 test suites covering:
- ๐ Emoji stripping (ZWJ sequences, skin tones, flags)
- ๐ค Regex escaping
- ๐ Wikilink formatting
- ๐ Content index building
- ๐ BFS traversal
- ๐ญ Protected region masking
- ๐ Candidate discovery with overlap prevention
- ๐ Replacement application with position preservation
- ๐ค AI validation prompt building
- ๐ Single-word title safety filtering
๐ Whatโs Next
๐ Potential enhancements for future iterations:
- ๐ Fuzzy matching for titles that donโt appear verbatim
- ๐ Bidirectional link detection (if A mentions B, should B link to A?)
- ๐ Link density analysis to avoid over-linking
- ๐งช A/B testing of link insertion strategies