Case Study - Gemini Competition Entry
A Chrome sidebar extension built for a Google Gemini competition, enabling page-aware chat, selective element querying, and voice-driven interaction directly within the browser.
- Date
- Roles
- Developer
- Tech
- Chrome Extension API,
- Gemini API,
- TypeScript,
- IndexedDB,
- Web Speech API
Core Capabilities
Page-aware conversational context: When the sidebar is opened on a page, a content script parses the page structure and text content and injects it into the conversation context. This allowed users to ask questions like 'Summarise this page', 'What should I do next?', 'Explain this section in simpler terms', or 'Translate this content'.
The competition explicitly encouraged leveraging Gemini's large context window, and this design focused on making the current page itself the primary source of context, rather than relying on manual copy/paste.
Selective Interaction Mode
Beyond whole-page context, users could enter a selective mode via the sidebar action bar.
In this mode users could highlight specific elements on the page, and select text or image elements and pass them to Gemini. Responses were scoped tightly to the selected element.
This made it possible to ask targeted questions such as clarifying a single paragraph in a dense article, describing an image, or understanding a specific form field or UI element.
The selective mode significantly improved accuracy by reducing irrelevant context and enabled more assistive use cases, such as describing page elements for low-vision users.
Voice Output with Synchronized Highlighting
All Gemini responses could be read aloud using text-to-speech.
Key features: Audio blobs generated per response and stored in IndexedDB, word-by-word transcript highlighting during playback, and persistent audio tied to conversation history.
This was intentionally designed to feel closer to an assistive reading experience rather than a generic 'read aloud' button.
Voice Input for Contextual Queries
Users could also ask questions using voice input: Speech was converted to text, the query was combined with the current page or selected element context, Gemini generated a response scoped to that context, and the response could then be read aloud using TTS.
This created a fully voice-driven loop for interacting with web content, especially useful for accessibility-oriented scenarios.
Technical Implementation
Custom sidebar UI with chat history and action bar, conversation memory and history persisted in IndexedDB, markdown-rendered Gemini outputs, user-supplied API keys (Gemini for LLM, Whisper for speech-to-text), content scripts for DOM extraction and element selection, and background/service worker coordination for request handling.
The entire system was built in a limited timeframe (a few weeks of part-time work), prioritizing core interaction flows over polish.
Constraints & Trade-offs
The project deliberately stopped short of becoming a full product due to Chrome extension constraints.
In particular: Sensitive API keys cannot safely live in extension client code, supporting first-party authentication would require a separate backend service, and proxying requests through an external service would fundamentally change the architecture.
Rather than overengineering around these limitations, I treated the extension as a prototype and capability exploration, not a production deployment.
Reflection
This project reinforced several practical lessons: Browser extensions are powerful but heavily constrained environments, context quality matters more than model complexity, selective scoping dramatically improves AI usefulness, and accessibility-oriented features often emerge naturally from good interaction design.
Google has since released native Gemini functionality in Chrome that overlaps with many of these ideas, validating the direction even if my implementation was intentionally lightweight and time-bound.