2026-03-21 | 📚 Book-Only Internal Linking — AI-Driven, Vault-Native, Incrementally Tracked

ai-blog-2026-03-21-book-only-internal-linking

🎯 A fundamental redesign of the internal linking system: Gemini AI identifies genuine book references, writes directly to the Obsidian vault, tracks analysis progress via frontmatter, handles JSON parsing edge cases, and logs diffs for all runs.

🔍 The Problem With the Previous Architecture

🚨 The original system had five distinct problems:

🎯 False positive matches — “🏗️🧱🌍 Foundation” in “a strong foundation for…” matched the book Foundation by Asimov. 🤷 Deterministic regex matched first, then Gemini verified — biasing toward “yes”.
💥 No rate-limit resilience — HTTP 429 silently treated all candidates as invalid.
🔄 No incremental progress — Every daily run re-analyzed every file from scratch.
📄 Writing to content/ directory — The content/ directory is a read-only mirror from the Obsidian vault, but the system was writing directly to it and then syncing back.
🔧 Fragile JSON parsing — Gemini sometimes returns JSON wrapped in markdown code fences or with trailing text, causing JSON.parse to fail silently.

🏗️ The Architecture: AI Identifies, Vault Stores, Frontmatter Tracks

📱 Vault-Native Operation

🔄 The biggest architectural change: the entire pipeline now operates on the Obsidian vault directly instead of the content/ directory.

Old: Read content/ → Write content/ → Sync to vault  
New: Pull vault → Read/Write vault → Push vault

🏗️ The workflow:

📥 Pull vault via obsidian-headless (ob sync)
🔗 Run linking with --content-dir pointing to the vault directory
📤 Push vault with all changes (links + frontmatter)

📱 This respects the principle that the Obsidian vault is the source of truth. 🚫 The content/ directory remains a read-only mirror that Enveloppe syncs from the vault. ⏭️ The BFS timestamp trail was removed — no longer needed since changes go directly to the vault.

🧠 Gemini as Identifier (Not Verifier)

🔄 Instead of “here are deterministic matches, verify them”, we ask Gemini “here’s the document and available books — which books are actually referenced?”

Old: Content → Regex Match → Gemini Verify → Insert Links  
New: Content + Book List → Gemini Identify → Find Positions → Insert Links

📊 buildIdentificationPrompt sends the full document body + all available book titles. 🤖 Gemini returns only relativePath strings for books genuinely referenced as literary works.

🔧 Robust JSON Parsing with `extractJsonArray`

🐛 Root cause analysis (5 whys):

❓ Why does JSON.parse fail? → Gemini returns extra content after the JSON array
❓ Why extra content? → Even with responseMimeType: "application/json", the model sometimes wraps JSON in markdown code blocks or adds explanation text
❓ Why isn’t this handled? → The code did JSON.parse(text) directly
❓ Why no extraction? → Trusted responseMimeType to always produce clean JSON
❓ Why isn’t that reliable? → Gemini models don’t always honor the mime type constraint

🔧 extractJsonArray handles:

✅ Clean JSON (direct parse)
✅ Markdown code fences (```json ... ```)
✅ Trailing explanation text
✅ Preceding text with bracket extraction

📋 Incremental Analysis via Frontmatter

🆕 Each file analyzed by Gemini gets frontmatter metadata:

---  
link_analysis_model: gemini-3.1-flash-lite-preview  
link_analysis_time: 2026-03-22T03:00:00.000Z  
force_analyze_links: false  
---

🔄 On subsequent runs, alreadyAnalyzed(content) checks for the presence of link_analysis_model and skips already-analyzed files. 📊 The link_analysis_model value is informational only — changing models does NOT trigger re-analysis.

🔑 To manually request re-analysis, set force_analyze_links: true in a file’s frontmatter. 🧹 recordLinkAnalysis clears the flag after processing.

📊 Diff Logging for All Runs

🆕 Both dry runs and live runs now emit unified diff events:

{"event":"diff","file":"books/example.md","dryRun":false,"diff":["@@ line 42 @@","- I recommend Thinking, Fast and Slow","+ I recommend [🤔🐇🐢 Thinking, Fast and Slow](books/thinking-fast-and-slow.md)"]}

📈 Summary Statistics

🆕 The completion event now includes filesSkipped:

{"event":"internal_linking_complete","filesVisited":50,"filesModified":2,"totalLinksAdded":3,"filesSkipped":35}

🛡️ Rate Limit Handling

🏷️ Error Type	🔧 Behavior
⏱️ Per-minute rate limit (429)	🔄 Retry up to 3 times with exponential backoff (5s → 10s → 20s)
📅 Daily quota exhaustion	🛑 Throw `QuotaExhaustedError` — halts the pipeline
❌ Other API errors	⏭️ Return empty array (skip file, continue pipeline)

🧪 Test Coverage

📊 145 tests in the internal-linking test suite (882 total repo-wide):

🧪 Test Suite	📊 Count	🎯 Coverage
🤖 `buildIdentificationPrompt`	4	✅ Book list, content, warnings, literary work references
📄 `extractBody`	3	✅ Frontmatter extraction, no frontmatter, unclosed
🔧 `extractJsonArray`	7	✅ Clean JSON, empty, code fence, trailing text, preceding text, no array, fence without tag
🚨 `isRateLimitError`	5	✅ 429, RESOURCE_EXHAUSTED, quota, negatives
📅 `isDailyQuotaError`	4	✅ Daily, PerDay, per-minute (false), non-quota (false)
⏱️ `parseRetryDelay`	4	✅ “retry in Ns”, “retryDelay”, no delay, null
💥 `QuotaExhaustedError`	3	✅ instanceof, default message, custom message
🔒 `contentAlreadyLinksTo`	5	✅ Wikilinks, markdown, anchors, negatives, prefix safety
📝 `generateDiff`	5	✅ Identical, changed, added, removed, unchanged
🕐 `updateFrontmatterTimestamp`	4	✅ Update, insert, create, nonexistent
📋 `updateFrontmatterFields`	4	✅ Multi-field, update existing, create block, nonexistent
📝 `recordLinkAnalysis`	2	✅ Writes model + time, clears force_analyze_links
🔍 `alreadyAnalyzed`	5	✅ Present, different model (still true), force flag, missing field, no frontmatter

🐛 Lessons Learned: The Top-Level Await Trap

🔥 The vault-native workflow failed in CI with: Top-level await is currently not supported with the "cjs" output format

🔍 5 Whys: Root Cause Analysis

❓ Why did the workflow fail? → The npx tsx -e command crashed with an esbuild transform error.
❓ Why did esbuild reject the code? → The inline TypeScript used await import(...) at the top level — a top-level await.
❓ Why doesn’t top-level await work? → tsx -e (eval mode) uses esbuild’s CJS output format, which doesn’t support top-level await.
❓ Why was CJS used instead of ESM? → The -e flag triggers eval mode where esbuild defaults to CJS. Unlike file-based execution (where tsx can infer ESM from package.json or .ts extension), eval mode has no module format hints.
❓ Why wasn’t this caught earlier? → The workflow was written by pattern-matching against file-based npx tsx commands (which support top-level await), not by studying the existing IIFE-wrapped npx tsx -e patterns already used in auto-blog-zero.yml and chickie-loo.yml.

✅ The Fix

🔧 Wrap all npx tsx -e code in an async IIFE (async () => { ... })(), matching the pattern used by the working workflows:

# ❌ Broken: top-level await in eval mode  
VAULT_DIR=$(npx tsx -e "  
  const { sync } = await import('./lib.ts');  
  process.stdout.write(await sync());  
")  
  
# ✅ Fixed: IIFE wrapper  
VAULT_DIR=$(npx tsx -e "(async () => {  
  const { sync } = await import('./lib.ts');  
  process.stdout.write(await sync());  
})()")

📝 Takeaway

🧠 Always study existing patterns in the codebase before writing new workflow steps. 🔍 The IIFE pattern was already established in two other workflows — copying it would have avoided this failure entirely. 🧪 Test npx tsx -e commands locally before committing — the CJS limitation is silent until runtime.

🐛 Lessons Learned: The Stdout Capture Trap

🔥 The vault pull step failed in CI with: Error: Unable to process file command 'output' successfully. Error: Invalid format '📥 Pulling latest vault content (warm cache fast path)...'

🔍 5 Whys: Root Cause Analysis

❓ Why did $GITHUB_OUTPUT reject the value? → The vault_dir=... line was followed by additional lines that weren’t in key=value format, breaking GitHub Actions’ single-line output parser.
❓ Why were there extra lines? → VAULT_DIR contained not just the path, but also log messages like ♻️ Re-using cached vault... and 📥 Pulling latest vault content....
❓ Why did log messages end up in VAULT_DIR? → syncObsidianVault() uses console.log() for status messages, which go to stdout. The bash $(...) capture grabs all stdout.
❓ Why was stdout capture used? → The pattern VAULT_DIR=$(npx tsx -e "... process.stdout.write(dir) ...") assumes the script only writes the result to stdout. But syncObsidianVault is a library function with its own logging.
❓ Why wasn’t this caught? → The other workflows (auto-blog-zero, chickie-loo) avoid this by doing everything inside the IIFE — they never need to pass the vault dir to a later step via $GITHUB_OUTPUT. The internal-linking workflow is the first to need the vault dir in separate steps.

✅ The Fix

🔧 Write the vault dir to a temp file from inside the IIFE, then read it back in bash — completely isolating the result from console.log output:

# ❌ Broken: console.log pollutes $(...)  
VAULT_DIR=$(npx tsx -e "(async () => {  
  const dir = await syncObsidianVault({...});  
  process.stdout.write(dir);  
})()")  
  
# ✅ Fixed: temp file sidesteps stdout  
npx tsx -e "(async () => {  
  const fs = await import('fs');  
  const dir = await syncObsidianVault({...});  
  fs.writeFileSync('/tmp/vault-dir.txt', dir);  
})()"  
VAULT_DIR=$(cat /tmp/vault-dir.txt)

📝 Takeaway

🧠 Never capture stdout when calling library functions that log. 📦 Library functions like syncObsidianVault write status messages to stdout via console.log. 🔧 When you need a return value from a Node.js script, write it to a temp file or use $GITHUB_OUTPUT from inside the script. 🔍 The other workflows avoided this by not needing to pass values between steps.

🐛 Lessons Learned: The Push Vault Self-Kill

🔥 The Push Obsidian Vault step failed with exit code 143 (SIGTERM): 🔪 Killing 3 lingering ob process(es): 2729, 2741, 2742 → Terminated

🔍 5 Whys: Root Cause Analysis

❓ Why did the Push step exit with SIGTERM? → killObProcesses inside pushObsidianVault sent SIGTERM to the tsx process itself (or its parent).
❓ Why did killObProcesses target the tsx process? → It uses ps -o pid,args | grep -E 'pattern' to find processes matching obsidian-headless or the vault dir path. The tsx process matched.
❓ Why did the tsx process match the vault dir pattern? → The vault path /tmp/obsidian-vault-cache appeared literally in the npx tsx -e "(... pushObsidianVault('/tmp/obsidian-vault-cache', ...) ...)" command-line args visible to ps.
❓ Why was the vault path literal in the eval string? → The workflow used ${{ steps.vault.outputs.vault_dir }} which expands at YAML parse time, embedding the path directly in the -e string.
❓ Why didn’t the process.pid filter catch this? → killObProcesses filters process.pid (the Node process), but npx spawns multiple child processes. The parent npm/sh processes also have the vault path in their args and aren’t filtered.

✅ The Fix

🔧 Pass the vault dir via environment variable instead of literal interpolation in the eval string:

# ❌ Broken: vault path appears in ps -o args  
run: |  
  npx tsx -e "(async () => {  
    await pushObsidianVault('${{ steps.vault.outputs.vault_dir }}', ...);  
  })()"  
  
# ✅ Fixed: env var resolved at runtime, invisible to ps  
env:  
  VAULT_DIR: ${{ steps.vault.outputs.vault_dir }}  
run: |  
  npx tsx -e "(async () => {  
    await pushObsidianVault(process.env.VAULT_DIR, ...);  
  })()"

🔬 Verified: with literal path in -e, grep finds 4 matching processes. 📊 With env var, grep finds zero matches — the vault path doesn’t appear in any process’s command-line args.

📝 Takeaway

🧠 Never embed dynamic paths literally in npx tsx -e strings when the called function kills processes by pattern. 🔍 killObProcesses greps for the vault dir in all process args — any process that has the path in its command line gets killed. 🔧 Pass paths via env vars so they’re resolved at runtime, not visible to ps.

🏁 Summary

📐 The internal linking system is now vault-native, AI-driven, and incrementally tracked. 📱 Changes write directly to the Obsidian vault instead of the content/ directory. 🧠 Gemini identifies genuine book references with full document context. 🔧 Robust JSON extraction handles Gemini’s formatting quirks. 📋 Frontmatter tracking enables incremental progress with manual override via force_analyze_links. 📊 Both live and dry runs log diffs and summary statistics.

bagrounds.org

Table of Contents

2026-03-21 | 📚 Book-Only Internal Linking — AI-Driven, Vault-Native, Incrementally Tracked

🔍 The Problem With the Previous Architecture

🏗️ The Architecture: AI Identifies, Vault Stores, Frontmatter Tracks

📱 Vault-Native Operation

🧠 Gemini as Identifier (Not Verifier)

🔧 Robust JSON Parsing with `extractJsonArray`

📋 Incremental Analysis via Frontmatter

📊 Diff Logging for All Runs

📈 Summary Statistics

🛡️ Rate Limit Handling

🧪 Test Coverage

🐛 Lessons Learned: The Top-Level Await Trap

🔍 5 Whys: Root Cause Analysis

✅ The Fix

📝 Takeaway

🐛 Lessons Learned: The Stdout Capture Trap

🔍 5 Whys: Root Cause Analysis

✅ The Fix

📝 Takeaway

🐛 Lessons Learned: The Push Vault Self-Kill

🔍 5 Whys: Root Cause Analysis

✅ The Fix

📝 Takeaway

🏁 Summary

Graph View

Backlinks

bagrounds.org

Table of Contents

2026-03-21 | 📚 Book-Only Internal Linking — AI-Driven, Vault-Native, Incrementally Tracked

🔍 The Problem With the Previous Architecture

🏗️ The Architecture: AI Identifies, Vault Stores, Frontmatter Tracks

📱 Vault-Native Operation

🧠 Gemini as Identifier (Not Verifier)

🔧 Robust JSON Parsing with extractJsonArray

📋 Incremental Analysis via Frontmatter

📊 Diff Logging for All Runs

📈 Summary Statistics

🛡️ Rate Limit Handling

🧪 Test Coverage

🐛 Lessons Learned: The Top-Level Await Trap

🔍 5 Whys: Root Cause Analysis

✅ The Fix

📝 Takeaway

🐛 Lessons Learned: The Stdout Capture Trap

🔍 5 Whys: Root Cause Analysis

✅ The Fix

📝 Takeaway

🐛 Lessons Learned: The Push Vault Self-Kill

🔍 5 Whys: Root Cause Analysis

✅ The Fix

📝 Takeaway

🏁 Summary

Graph View

Backlinks

🔧 Robust JSON Parsing with `extractJsonArray`