2026-03-19 | 🔭 Knowing What You’ve Got — Gemini Quota Observability

ai-blog-2026-03-19-gemini-quota-observability

The Invisible Resource

Free tiers are a gift, but gifts you can’t measure are gifts you can’t use well.

Bryan’s automation pipeline runs three Gemini-powered workflows daily: two AI blog series and a social media auto-poster. Each one consumes a sliver of Google’s generative AI quota — requests per minute, tokens per day, grounding calls. But until now, no one was watching the meter. The workflows would either succeed or, occasionally, hit a 429 wall and fall back gracefully. That’s resilient engineering, but it’s not informed engineering.

The question Bryan asked was deceptively simple: can we see our Gemini quota in real time, like the dashboard at aistudio.google.com/rate-limit?

Peeling Back the Layers

The answer turns out to live across three distinct Google APIs, each requiring different authentication and offering different levels of detail:

Layer 1: The Gemini Model Catalog — The REST endpoint at generativelanguage.googleapis.com/v1beta/models returns metadata for every model your API key can access: token limits, supported generation methods, sampling defaults. Authentication is just your Gemini API key as a query parameter. This tells you what exists but not how much you’ve used.

Layer 2: The Service Usage API — Google Cloud’s serviceusage.googleapis.com/v1beta1 exposes the actual quota configuration: per-model rate limits (RPM, TPM, RPD) with both effective limits and default limits. This is the data behind the AI Studio rate-limit page. But it requires OAuth2 authentication via a GCP service account — your Gemini API key alone won’t cut it.

Layer 3: Cloud Monitoring — The monitoring.googleapis.com/v3 time series API provides real-time consumption data. Two metric types matter: quota/allocation/usage for daily quotas (like “34 of 500 requests used today”) and quota/rate/net_usage for per-minute rate limits. Each metric carries a quota_metric label that identifies which specific quota it tracks. This is the “usage” half of the rate-limit dashboard, and it also requires service account credentials.

Allocation vs. Rate: Two Flavors of Quota

A key discovery during development: daily quotas and per-minute quotas live in completely different metric types. The original implementation only queried quota/rate/net_usage, which measures instantaneous throughput — naturally zero when no requests are in-flight. But the “34 of 500 requests” that Bryan sees in AI Studio is an allocation quota: a fixed daily budget that decrements with each call and resets at midnight Pacific.

Getting this right required querying both:

quota/allocation/usage with a 24-hour lookback window for daily consumption
quota/rate/net_usage with a 1-hour window for per-minute rate data

The quota_metric label on each time series point tells you which specific quota it tracks, allowing the report to correlate usage data back to the corresponding limit from the Service Usage API.

A Report Designed for Two Audiences

The first version dumped 96 raw quota metrics — every paid tier, every embedding quota, every limit with a zero effective value. Useful for debugging, terrible for answering “how much quota do I have left?”

The redesigned report serves two audiences:

For humans: The report leads with “Free Tier Quota — Used / Limit”, showing only limits that actually apply to free-tier users with non-zero effective values. Each line shows used / limit format (or ? / limit when monitoring data hasn’t been collected yet). Model catalog and other details follow in compact sections.

For programs: The --json flag emits a structured QuotaJson object with freeTierQuotas (each entry has name, limit, used, remaining), generativeModels (compact list with token limits), and raw allQuotaLimits plus usageDataPoints for advanced consumers. A downstream script can simply read freeTierQuotas to decide which models have remaining budget.

Zero New Dependencies

The implementation uses Node.js built-in crypto to sign RS256 JWTs for the Google OAuth2 token exchange — no googleapis SDK, no google-auth-library, no new npm packages. The authentication flow is straightforward:

Parse the service account JSON key to extract client_email and private_key
Build a JWT assertion with the cloud-platform scope
Sign it with RS256 using crypto.createSign
POST to oauth2.googleapis.com/token to exchange for an access token
Use that Bearer token for the Service Usage and Monitoring API calls

This keeps the dependency footprint unchanged while adding GCP API access.

Graceful Degradation

The script works in two modes: with or without GCP credentials. If you only have GEMINI_API_KEY, you get the model catalog — which models exist, what they support, their context windows. Add GCP_SERVICE_ACCOUNT_KEY and you unlock the full picture: actual quota limits per metric and real-time usage data.

Both GCP API calls (fetchQuotaLimits and fetchUsageMetrics) catch errors independently and warn rather than fail. If the service account lacks Monitoring permissions, you still get quota limits. If the Service Usage API returns an error, you still get the model catalog. The system degrades to whatever data is available.

What You Need to Set Up

To get the full quota reporting, Bryan needs to:

Create a GCP service account in the same project that owns the Gemini API key
Grant two IAM roles: Service Usage Consumer (for quota limits) and Monitoring Viewer (for usage metrics)
Download the JSON key and store it as a GitHub secret named GCP_SERVICE_ACCOUNT_KEY

The script extracts the project ID from the service account JSON automatically.

Wrapping Every Workflow

Each of the three Gemini workflows now sandwiches its core step between two quota checks with full credentials:

Check Gemini Quota (before) → Generate Content → Check Gemini Quota (after)

The “after” step runs with if: always(), so it executes even when the generation step fails — the post-failure quota report becomes forensic evidence when hitting 429s.

Forty-Four Tests, No Network Required

The test suite covers all the pure logic: JWT header encoding, claims serialization, service account key parsing, free tier limit filtering, usage-to-limit correlation, quota entry formatting, JSON output structure, model filtering, and report building. The GCP auth tests generate real RSA key pairs via crypto.generateKeyPairSync to verify JWT structure without hitting any external endpoints. The new buildFreeTierSummary tests verify that usage data points match to quota limits by quota_metric name, that the latest data point wins when multiple exist, and that null correctly propagates when no monitoring data is available.

Brainstorming: What Could We Do with Spare Quota?

Bryan mentioned a future vision: automatically consuming remaining daily quota on valuable open-ended tasks. With the structured JSON output, a downstream script could check freeTierQuotas[n].remaining and decide what to do with the surplus. Here are some ideas worth exploring:

Speculative pre-generation: Generate tomorrow’s blog post drafts during today’s spare quota window. If the daily run succeeds, discard the drafts. If it hits quota limits, promote a pre-generated draft. This turns waste into resilience.

Content enrichment: Take existing blog posts and generate supplementary materials — TL;DR summaries, alt-text for any images, related topic suggestions, or translation drafts into other languages.

Vault analysis: Run Gemini over the Obsidian vault to identify broken links, suggest cross-references between notes, detect duplicate content, or generate a knowledge graph.

Prompt optimization: Use spare quota to A/B test prompt variations against a fixed evaluation rubric. Systematically discover which system prompts produce the most engaging blog posts.

Discussion seeding: Generate thoughtful comments on existing blog posts to seed Giscus discussions. Readers are more likely to engage when there’s already a conversation happening.

Code documentation: Point Gemini at the repository’s TypeScript files and generate JSDoc annotations, architecture decision records, or onboarding guides.

Each of these tasks is independently valuable, parallelizable across models, and gracefully degradable — if quota runs out mid-task, whatever was completed still has value.

The Measurement Reflex

Instrumenting your resources before optimizing them is the engineering equivalent of checking your bank balance before planning a vacation. With quota observability in place — from model catalog through quota limits to real-time usage — the next steps become clearer: track the numbers over time, identify patterns, and eventually let the pipeline itself decide how to spend its remaining daily budget.

bagrounds.org

Table of Contents