A published Claude Code plugin with 5 MCP tools, 8 slash commands, and 151 passing tests. Built in 3 days, 12 sessions, and 1.27 million tokens of development activity. The plugin bridges two AI systems that are better together than apart: Claude's precise code generation paired with Gemini's 1M-token context window, Google Search grounding, and multimodal processing.
Key Takeaways
- Production Claude Code plugin bridging Gemini CLI built in 3 days, 12 sessions, 1.27M tokens, spec-driven from minute one
- 37% of enterprises now use 5+ AI models (a16z, 2025); multi-model tooling is the direction, not the exception
- CodeRabbit caught 14 real bugs across 6 PRs including race conditions and data-corrupting subprocess lifecycle issues
- 151 tests passed but the plugin still broke. Packaging validation wasn't tested until the final day
The idea started with a question I couldn't shake: what if Claude Code could call Gemini? Not as a replacement. As a complement. Claude handles multi-file refactoring and deep tool integration better than anything I've used. Gemini brings real-time web search, video and audio analysis, Google Workspace automation, and a context window that swallows entire codebases whole. Separately, powerful. Together, they cover almost everything a developer needs.
This is the story of building that bridge. The specs that were wrong within ten minutes, the subprocess bugs that ate 30% of the budget, and the "final boss" that made 151 passing tests irrelevant.
If you want context on the broader plugin ecosystem this came from, see How I Built 30 AI Plugins in 6 Days Using Claude Code.
Why Would You Bridge Two AI Systems?
84% of developers now use or plan to use AI tools in their workflow, up from 76% the prior year (Stack Overflow Developer Survey, 2025). But here's the part most people miss: developers aren't picking one model. They're picking several. Claude sits at 45% adoption among professional developers. Gemini is at roughly 35%. GPT leads at 82%. The overlap tells you something. Developers want different tools for different jobs.
Multi-model isn't a hedge. It's specialization. Claude Code excels at precise code generation, multi-file refactoring, and agentic tool use through MCP. Gemini brings capabilities Claude doesn't have: a 1M-token context window for analyzing entire codebases at once, real-time Google Search grounding for up-to-the-minute research, native video and audio processing, and direct Google Workspace automation. You wouldn't use a screwdriver to hammer nails. Same principle.
The MCP ecosystem grew from 2 million to 97 million monthly downloads between November 2024 and March 2026 (MCP Manager, 2025-2026). That growth curve isn't slowing. Developers want composable AI tools that plug into their existing workflows, not walled gardens that force them to switch contexts.
37% of enterprises already use five or more AI models in production (a16z, 2025). That number was 29% the year before. The multi-model future isn't coming; it's here. This plugin is one implementation of that thesis: make Gemini available inside Claude Code with a single slash command.
How Did Writing Five Specs First Save a 3-Day Build?
AI-coauthored pull requests contain 1.7x more issues than human-only code, with up to 75% more logic and correctness errors in incident-prone areas (CodeRabbit, 2025). When you're building fast with AI assistance, quality guardrails aren't optional. They're the only thing standing between you and a pile of subtle bugs.
So before writing a single line of TypeScript, I wrote five formal specifications. All five were completed in a six-minute burst: 53,722 tokens of architectural thinking.
- Spec 01: Plugin scaffold and configuration
- Spec 02: Gemini CLI adapter, the core engine
- Spec 03: Session management for multi-turn continuity
- Spec 04: Authentication and credential detection
- Spec 05: Commands, agents, and the user-facing layer
Spec 02 alone consumed 10,934 tokens defining NDJSON event parsing, subprocess lifecycle management, and the exact CLI flags needed. Spec 04 described a four-method authentication priority chain. These weren't sketches. They were detailed enough to implement from.
And they were wrong within ten minutes.
Observation #9694, logged at 15:07 UTC on March 31, less than ten minutes after finishing the specs, recorded a discovery that reshaped the entire plugin: the manifest had to live at .claude-plugin/plugin.json, not as a top-level file. Three minutes later, another correction: hooks use hooks.json configuration, not individual markdown files.
Here's the thing that matters: the specs weren't wasted by being wrong. They gave the project architectural coherence across three intense days of building. When reality contradicted a spec, I corrected the spec and moved forward. Specs as hypotheses, not blueprints. That framing saved the build.
Does that mean you should skip specs if they'll just be wrong? No. The contrast with codebases that start by "just building" is stark. Those projects accumulate architectural debt from minute one. With specs, even wrong ones, you're correcting a coherent vision instead of discovering you never had one. I learned this the hard way on an earlier project where I threw away the first day of code because I hadn't specced the architecture first.
Why Did Subprocess Management Eat 30% of the Token Budget?
Experienced open-source developers took 19% longer on tasks when using AI tools, despite self-estimating a 20% speedup in a controlled study (METR, 2025). Complex, stateful work, the kind where edge cases compound, is exactly where AI assistance hits its limits. Subprocess management in Node.js is that kind of work.
The Gemini adapter's job sounds simple: spawn gemini as a subprocess, parse its NDJSON output, handle errors, manage timeouts. In practice, it meant wrestling with every edge case in Node.js's child_process API. The adapter consumed 249,016 tokens across design and bug-fixing, totaling 19.6% of the entire project budget in just two observations.
Here are the five hardest bugs, all caught in a single CodeRabbit review round:
The Promise.race data loss condition. The original timeout used Promise.race() to race event collection against a timeout promise. But a cleanly exiting process caused spawnErrorPromise to resolve via .catch() on exitPromise, winning the race early and silently truncating responses. Fix: make spawnErrorPromise never-resolving on clean exit.
The SIGKILL escalation leak. After sending SIGTERM, a 5-second timer scheduled SIGKILL as escalation. If the process exited promptly, the timer wasn't cleared and fired unnecessarily. A resource leak hiding in plain sight.
Signal termination treated as success. When a process is killed by a signal, Node.js sets exitCode to null. The original code only checked for non-zero exit codes, allowing signal kills to silently return partial data.
The spawn ENOENT hang. When the Gemini binary isn't installed, spawn emits an ENOENT error event. No listener meant the exit promise hung forever. Users would see... nothing.
Delta message accumulation. Streaming assistant messages arrive with a delta flag indicating they should be appended. The original code always replaced instead of accumulating. Only the last chunk survived.
Every one of these bugs would have shipped without automated review. Every one would have caused real user-facing failures: silent data loss, hanging processes, corrupted responses.
If you're building a tool that spawns subprocesses, budget three times the time you think you'll need. The happy path is 10% of the work.
What Did Automated Review Catch That Tests Missed?
CodeRabbit achieved a 51.2% F1 score, the highest among all AI code review tools, with 49.2% precision and 53.5% recall across roughly 300,000 real-world pull requests (CodeRabbit, 2026). Those numbers matter because they mean automated review isn't just fast. It catches things humans miss.
Every pull request in this project went through CodeRabbit review. Six PRs, six reviews. Far from ceremonial, these reviews were load-bearing quality infrastructure. Would I have caught the same bugs manually? Some, maybe. All 14? Not a chance.
PR #1 (Plugin Scaffold): Caught a missing validation for the approval_mode configuration field. The model and sandbox fields had whitelist checks. approval_mode didn't. Fixed within the hour.
PR #2 (Gemini Adapter): The big one. First review round identified five critical bugs: the spawn error handler gap, delta accumulation, SIGKILL timer behavior, exit promise rejection handling, and missing argument arrays. Cost to process and fix: 132,660 tokens. Second round went deeper into subprocess lifecycle, catching four more process bugs. This single PR took 22 hours to merge.
PR #3 (Session Management): Flagged a version mismatch bug in SessionManager.load(). It detected an unexpected version but continued processing the incompatible data structure instead of returning an empty index. A time bomb.
PR #5 (Review Findings): Caught the most operationally critical bug in the entire project. The check-gemini.sh script called exit 1 when Gemini CLI wasn't found. That single line would have prevented the entire plugin from loading for anyone who hadn't installed Gemini yet. The fix: change it to a warning with exit 0.
One in every seven observations was fixing something broken. That 15% bugfix rate would have been much higher without automated review catching issues before they compounded.
Plugin Validation: The Final Boss
151 tests passing. Six PRs merged. Six CodeRabbit reviews survived. Zero packaging validation checks run. That's the number that mattered, and it was zero.
On April 2, Claude Code's plugin validator rejected the manifest entirely. Three structural violations, none caught by the test suite:
-
The
skillsfield inplugin.jsonwas invalid. Unlike commands and agents, skills only auto-discover from a./skills/directory and can't be declared in the manifest. The validator treated the unknown field as an error that cascaded into corrupting theagentsfield parsing too. -
All component directories were nested inside
.claude-plugin/. They should have been at the plugin root. Claude Code's specification mandates that onlyplugin.jsonlives in.claude-plugin/; commands, agents, and skills auto-discover from root-level directories. -
The
hooks.jsonused the wrong format. User-settings format (direct event keys at root) instead of the plugin format (events wrapped in a"hooks"object).
The fix required restructuring the entire plugin layout, stripping invalid fields, and rewriting the test suite to validate root-level component discovery. This was humbling. The code was correct. The package was wrong. And no amount of unit tests would have caught it.
Here's the takeaway I keep coming back to: integration testing against the actual host platform should happen on Day 1. Not as a "final step." Not after everything else passes. Day 1. Make it a CI gate.
From Empty Directory to Published Plugin
Over 1.1 million public repositories on GitHub now use an LLM SDK, with 693,867 created in the past 12 months alone, a 178% year-over-year increase (GitHub Octoverse, 2025). The AI tooling ecosystem is exploding. This plugin is one small piece of that wave.
| Metric | Value |
|---|---|
| Development time | 3 days (March 31 - April 2, 2026) |
| Sessions | 12 |
| Total development tokens | 1,271,020 |
| Pull requests merged | 6 |
| Final test count | 151 across 10 test files |
| Slash commands | 8 |
| MCP tools | 5 |
| Router agents | 1 |
| Skills | 4 |
| Specifications written | 5 |
| Bugs caught by CodeRabbit | 14+ |
The plugin is open source: github.com/NaluForge/geminicli-cc-plugin.
Install it with claude plugin add @naluforge/gemini-cc-plugin or clone and build from source. Eight slash commands give you direct access to Gemini's capabilities without leaving Claude Code. /gemini-analyze handles 1M-token codebase analysis, /gemini-search does web-grounded research, and /gemini-workspace automates Google Workspace tasks. See the full command reference on GitHub.
What Would I Do Differently?
The biggest lesson from this project is counterintuitive: the specs were the most valuable thing I wrote, and they were wrong almost immediately. That's not a contradiction. It's the point.
1. Validate packaging on Day 1, not Day 3. The plugin validation failures were the most frustrating part of the entire build because they were structural, not logical. Run claude plugin validate before you write a single test. Make it a pre-commit hook. Don't wait.
2. Budget 3x for subprocess management. NDJSON parsing took an afternoon. The timeout handling, signal escalation, process cleanup, race conditions in Promise.race, spawn error detection, and delta accumulation took two days. If your tool spawns subprocesses, the happy path is 10% of the work.
3. Treat automated code review as load-bearing infrastructure. CodeRabbit wasn't a rubber stamp on this project. It caught the exit 1 that would have broken plugin loading, the Promise.race condition that silently dropped data, and the version mismatch that would have corrupted session state. Any one of those ships to production without review.
Six of the ten fastest-growing open source projects by contributor count in 2025 were AI infrastructure or tooling projects (GitHub Octoverse, 2025). The tooling layer is where the action is. If you're thinking about building a plugin that connects AI systems, whether Claude and Gemini or any other combination, the 3-day timeline isn't aspirational. Spec it first, test the packaging early, let automated review catch what you miss, and ship.
For the full story of how this plugin fits into a larger 30-plugin ecosystem, see How I Built 30 AI Plugins in 6 Days.
Frequently Asked Questions
Can I use this plugin without Gemini CLI installed?
Yes. The plugin degrades gracefully. It loads and registers all commands, but execution returns a clear error message with installation instructions. The check-gemini.sh hook warns but doesn't block. This was actually the fix for one of CodeRabbit's catches: the original version called exit 1, which would have prevented the entire plugin from loading.
How much did this cost to build?
The direct cost was 1.27 million tokens of Claude development activity across 12 sessions over 3 days. At current API pricing, that's roughly $15-25 in token costs. The real investment was time: approximately 20 hours of focused work across three days. No team. No external dependencies beyond CodeRabbit for automated review, which is free for open source projects.
Would spec-driven development work for smaller projects?
Absolutely. It actually works better for smaller projects because the specs are shorter. Even a 500-token spec that defines your data model, API surface, and error handling strategy before you start coding prevents the "just build it" drift that turns small projects into messy ones. The five specs for this project totaled under 54,000 tokens. Most were written in under two minutes each.
What capabilities does the plugin add to Claude Code?
Eight slash commands covering Gemini's core strengths: /gemini for general prompts, /gemini-analyze for 1M-token codebase analysis, /gemini-search for Google Search-grounded research, /gemini-workspace for Gmail, Docs, Sheets, Calendar, and Drive automation, /gemini-media for video, audio, image, and PDF processing, /gemini-data for Python-based data analysis, /gemini-config for status checks, and /gemini-review for independent code review cross-checks.
How does session management work across Gemini calls?
The plugin tracks multi-turn conversations through a SessionManager that records each interaction during gemini_execute and updates context during gemini_resume. Sessions are stored locally with version-tagged indexes. The implementation uses dependency injection into existing tool handlers, which meant zero breaking changes to the adapter code. It was one of the cleanest integrations in the project: 38 new tests, zero modifications to src/.
This case study was built from 241 observations captured across the gemini-cc-plugin development lifecycle, totaling 1,271,020 tokens of recorded development activity. The plugin is open source at github.com/NaluForge/geminicli-cc-plugin.
