Architecture & Engineering Decisions
Every significant design decision, constraint, and tradeoff behind ASight — from LLM routing to RAG pipeline design to OAuth token management.
Tech Stack
System Overview
Next.js monolith · Chrome extension · five external ecosystems
Chrome Extension (React/Vite) ──────────────────────┐
│
Browser · Next.js Web App │
pages/ · webapp components · contexts │
│ │
REST API Routes (/api/*) │
│ │
┌────────────┼────────────┐ │
│ │ │ │
LLMs Vector DB External APIs │
Groq/Llama Pinecone Google · Microsoft │
OpenAI GPT-4 Slack · Discord · Zoom │
│ │
PostgreSQL (Prisma ORM) ◄─────────────────────────────┘Architecture layers
ASight is a full-stack AI productivity assistant built as a single Next.js application with an accompanying Chrome extension. The system integrates with five external ecosystems (email, calendar, documents, web, messaging) and orchestrates multiple AI models to provide natural-language interfaces over each.
One backend, two clients
The Chrome extension communicates directly with the same /api/* routes as the web app, using the same JWT auth. There is no separate extension backend — the extension is a thin UI layer over the same REST API. One codebase, one deploy, one auth system shared between both clients.
Dual LLM Strategy
Groq/Llama for privacy-sensitive ops · OpenAI GPT-4 for function calling
Why two models?
Every AI operation is assigned to a model based on three criteria: latency, function calling reliability, and privacy sensitivity. Groq runs Llama 3.3 70B at ~500–800 tokens/sec — roughly 10× faster than OpenAI for the same model class. Groq/Llama handles all privacy-sensitive operations (message content, uploaded documents). OpenAI GPT-4 is reserved exclusively for calendar operations that require reliable structured output via function calling.
Why not GPT-4 everywhere?
Privacy optics. Users connecting Gmail, Slack, and uploading private documents are making a trust decision. Routing their message contents through OpenAI is a harder sell than an open-source model. "We use open-source AI for your sensitive data" is a credible commitment. Calendar queries are different — they involve creating structured events, not reading sensitive content, so GPT-4's superior function calling is worth the tradeoff there.
RAG Pipeline — Document Q&A
Chunking · Pinecone vector search · Groq answer generation with citations
Full pipeline
Upload → extract text (pdf-parse / mammoth / tesseract OCR) → chunk into 1000-char segments with 200-char overlap → embed → Pinecone upsert. On query: embed question → Pinecone similarity search (topK=5) → retrieve raw chunk text from PostgreSQL → build LLM context → Groq generates answer with citations.
Why chunking with overlap?
200-char overlap prevents an answer from spanning a chunk boundary and being missed by retrieval — without overlap, a sentence split exactly at a boundary would appear in neither chunk at full strength. Chunk size 1000 chars (~200 words) balances retrieval precision against context completeness.
Per-user Pinecone namespace
Each user's documents are isolated in their own Pinecone namespace. Without this, a similarity search could surface chunks from another user's documents. Namespacing by userId ensures all queries are scoped to the authenticated user only — security without the overhead of separate indexes per user.
Why DocumentChunk in PostgreSQL too?
Pinecone stores vectors and metadata, not raw text. After Pinecone returns matching IDs, the answer pipeline fetches full chunk text from PostgreSQL with a single indexed query. Vector ops stay in Pinecone (fast ANN search), text storage stays in the relational DB (reliable, consistent retrieval).
Tab Q&A — Scraping Architecture
Chrome extension local extraction · Firecrawl for web app URL scraping
Two paths for tab content
The extension runs in the user's browser with DOM access. Content scripts extract text directly from live pages — no external service needed, and it works on authenticated pages (Gmail, banking) because the user is already logged in. The web app can't access browser tabs, so users paste URLs and the server fetches them via Firecrawl, which handles JS-heavy SPAs that return skeleton HTML on a plain fetch().
Smart domain detection & cache strategy
60+ domain patterns are classified as JS-heavy (SPAs, dynamic content) — Firecrawl auto-enabled for these, plain fetch for static sites. Cache TTL varies by content type: 0s for financial data (Bloomberg, Yahoo Finance — always fresh), 2–5min for news, 1h for documentation.
40% speed improvement
Firecrawl timeout 30s→45s (reduces silent failures on slow SPAs), content limit 20KB→50KB (more context = better answers), 5 concurrent scrapes with Promise.all instead of sequential, and the LLM URL selection pre-step removed for the web app (saves 1–2s per request).
Multi-Provider OAuth
Google · Microsoft · Slack · Discord · Zoom — one token refresh pattern
Why not next-auth?
next-auth manages "who is logged in to ASight" — not "what APIs can ASight call on the user's behalf." These are separate concerns. App auth uses JWT (email + password, bcrypt). API tokens are per-service OAuth tokens in the DB for Gmail/Slack/etc API calls. Mixing them into next-auth sessions would require custom session callbacks per provider — more complex than the direct approach.
Transparent token refresh
All OAuth tokens are stored with expiresAt. MessageService.getAccessToken() checks expiry and transparently refreshes before every API call — callers never handle refresh themselves. Microsoft/Slack refresh tokens don't expire. Google expires after 6 months of non-use. Discord has a 30-day TTL.
Inbox Aggregation
Gmail · Slack · Microsoft Teams · Discord — normalized into one interface
Normalization layer
Gmail, Slack, Teams, and Discord all return different message shapes. MessageService normalizes them into a unified Message interface before they reach any service layer: { id, service, from, subject?, channel?, body, receivedAt, isRead, threadId? }. The LLM summarization prompt receives this normalized array — it never sees platform-specific fields.
Why server-side sync (not streaming to frontend)?
An earlier design streamed messages from Gmail/Slack directly to the frontend. Two problems: (1) CORS — the frontend can't call Gmail's API with server-side tokens; (2) rate limits — every page load would trigger platform API calls. Current design fetches server-side, stores in MessageRollup, serves from DB. Frontend makes one API call to /api/inbox. Sync is triggered explicitly or on interval.
Calendar Agent with Function Calling
OpenAI GPT-4 function calling → Google Calendar API
Why function calling for calendar?
Calendar operations require precise structured output — ISO datetime, timezone, duration, attendees, recurrence rules. Free-text LLM output for these parameters is unreliable. GPT-4's function calling API produces reliable JSON tool calls that CalendarService passes directly to the Google Calendar API without any parsing step.
Multi-turn conversation + timezone
"What do I have tomorrow?" followed by "Move the 3pm one to 4pm" requires the agent to remember context. A conversation history array is passed with each GPT-4 call. The user's timezone from Intl.DateTimeFormat().resolvedOptions().timeZone is in the system prompt so relative expressions ("next Tuesday") resolve to correct ISO 8601 datetimes.
Chrome Extension Architecture
React/Vite · same REST API as web app · local DOM extraction
Why Vite (not Next.js)?
Chrome extensions require a static bundle with an explicit manifest.json. Next.js compiles to a server-rendered app — not a Chrome-compatible static bundle. Vite builds a clean static bundle with React that the extension manifest can reference directly.
Session sharing & CORS isolation
The extension stores the JWT token in localStorage (extension context). Both extension and web app call the same /api/* endpoints with the same JWT. CORS middleware explicitly allows chrome-extension://[id] as an origin, preventing other websites from making credentialed requests using the user's session.
Privacy-First Design
Open-source LLM for sensitive data · local tab extraction · 7-day document TTL
LLM choice as a privacy signal
Routing private messages and documents through Llama (open-source, via Groq) instead of OpenAI is a deliberate product decision. For a tool that accesses Gmail and private documents, "we use open-source AI for your sensitive data" is a credible, verifiable commitment — not just a policy claim.
Local tab extraction
Browser tab content is extracted locally by the content script and sent as text — not as a URL for the server to fetch. The server never receives session cookies or authenticated page URLs. Banking sites, social feeds, and healthcare portals are on a server-side blocklist that returns a warning instead of extracting content.
7-day document auto-deletion
DocumentCleanupService runs a daily cron to delete documents older than 7 days from both PostgreSQL and Pinecone. Enforced by code, not policy. OAuth tokens are stored server-side only — the extension never handles raw OAuth tokens, preventing exfiltration via XSS.
Stack Choices
Next.js monolith · Prisma · Pinecone · CSS modules
Why a single Next.js monolith
API routes live in app/api/ and share TypeScript types with the frontend. The Chrome extension calls the same /api/* endpoints as the web app — one codebase, one deploy, one auth system. No coordination overhead between two repos, no API contract drift.
Why CSS modules (not a component library)
The design system is custom — forest green / icy cyan palette, glass morphism cards, animated gradients. MUI or HeroUI would have required extensive theming overrides. CSS modules per page give precise control over animations without fighting a framework's opinions.
Why Pinecone over pgvector
Pinecone is managed — no index tuning, no extension to maintain, no I/O competition with SQL queries. pgvector co-locates vector and relational storage, meaning a large similarity search could compete for I/O with regular DB queries. Decoupling is the right call at this scale.
Database Design
PostgreSQL · Prisma ORM · UUID PKs · conversation threading
Schema overview
Organized around 5 use cases + core auth: User · OAuthToken · Conversation · TabQALog · MessageRollup · DraftMessage · CalendarAction · PeekWindow · DocumentSource · DocumentChunk · DocumentQALog. All Q&A interactions thread through Conversation with a useCase field — a single query powers the unified history sidebar.
UUID primary keys everywhere
For an application syncing with external systems (Gmail message IDs, Google Calendar event IDs), integer PKs risk ID collision or confusion between internal and external IDs. UUIDs have no such risk and can be generated before a DB round-trip — useful for optimistic UI updates.
Why DocumentChunk in PostgreSQL (not only Pinecone)
Pinecone stores vector embeddings and metadata, not raw text. After Pinecone returns matching IDs, the answer pipeline fetches full chunk text from PostgreSQL in a single indexed query. Vector ops in Pinecone, text in the relational DB — each storage engine does what it's best at.
Deployment
VPS (Docker + Nginx) · Vercel · PostgreSQL · Pinecone
VPS for persistent processes
The web app runs on a VPS via Docker (Next.js + Nginx reverse proxy). Vercel serverless functions have a 10s timeout — insufficient for document indexing and inbox sync. The VPS handles long-running operations. Nginx provides SSL termination via Let's Encrypt and serves static assets.
Why cleanup cron runs in-process
The daily document cleanup cron runs inside the Next.js process via node-cron. A separate worker or external scheduler (Render cron, GitHub Actions) would add infrastructure complexity for one once-a-day task. If the server is down at cleanup time, documents are deleted the next day — a one-day delay has no user-visible impact.
Summary of Key Tradeoffs
Every significant decision had a real cost — here they are in one place.
| Decision | Benefit | Cost |
|---|---|---|
| Groq/Llama for sensitive ops | ✓ Privacy story + low latency + low cost | △ Less reliable structured output than GPT-4 |
| GPT-4 for calendar only | ✓ Reliable function calling for structured events | △ Higher cost per calendar query |
| Pinecone for vectors | ✓ Managed, fast, per-user namespace isolation | △ Another vendor dependency + ongoing cost |
| Direct OAuth (no library) | ✓ Full control per provider, no abstraction friction | △ More boilerplate per new provider |
| Tab extraction via content scripts | ✓ Works on authenticated pages, local privacy | △ Extension required; web app needs Firecrawl |
| 7-day document auto-deletion | ✓ Strong privacy default, enforced by code | △ Users must re-upload for persistent Q&A |
| CSS modules (no component lib) | ✓ Full design control, custom animations | △ More CSS to maintain |
| Single Next.js monolith | ✓ One codebase, shared types, one deploy | △ Long-running jobs need VPS (not Vercel) |
Want to see it in action?
All of this architecture, live in a working product.