Engineering Deep-Dive

Architecture & Engineering Decisions

Every significant design decision, constraint, and tradeoff behind ASight — from LLM routing to RAG pipeline design to OAuth token management.

Full-stack Next.jsMulti-LLMPrivacy-firstRAG pipeline
5
External integrations
Google · Microsoft · Slack · Discord · Zoom
2
LLM providers
Groq/Llama · OpenAI GPT-4
11
DB tables
PostgreSQL via Prisma ORM
60+
Domain patterns
Smart JS-heavy site detection
40%
Speed improvement
Tab Q&A scraping pipeline

Tech Stack

Next.js 15React 19TypeScriptTailwind CSSNode.jsPrisma ORMJWT AuthbcryptPostgreSQLPineconeGroq / Llama 3.3 70BOpenAI GPT-4LangChainFirecrawlTesseract OCRChrome ExtensionViteDockerNginxVercel
FrontendBackendDataAIPlatformInfra

System Overview

Next.js monolith · Chrome extension · five external ecosystems

Key decisionOne codebase, one deploy, two clients — extension reuses the same REST API as the web app.
Chrome Extension (React/Vite) ──────────────────────┐
                                                        │
Browser  ·  Next.js Web App                            │
  pages/ · webapp components · contexts                │
                    │                                   │
          REST API Routes  (/api/*)                     │
                    │                                   │
       ┌────────────┼────────────┐                      │
       │            │            │                      │
     LLMs      Vector DB    External APIs               │
  Groq/Llama   Pinecone     Google  ·  Microsoft        │
  OpenAI GPT-4              Slack  ·  Discord · Zoom    │
       │                                                │
 PostgreSQL  (Prisma ORM) ◄─────────────────────────────┘

Architecture layers

ASight is a full-stack AI productivity assistant built as a single Next.js application with an accompanying Chrome extension. The system integrates with five external ecosystems (email, calendar, documents, web, messaging) and orchestrates multiple AI models to provide natural-language interfaces over each.

One backend, two clients

The Chrome extension communicates directly with the same /api/* routes as the web app, using the same JWT auth. There is no separate extension backend — the extension is a thin UI layer over the same REST API. One codebase, one deploy, one auth system shared between both clients.

Dual LLM Strategy

Groq/Llama for privacy-sensitive ops · OpenAI GPT-4 for function calling

Key decision~500–800 tokens/sec on Groq vs ~50–80 on OpenAI — 10× faster where latency is user-visible.
TaskModelReason
Tab Q&AGroq / Llama 3.3 70BFast inference, no function calls needed
Inbox summarizationGroq / Llama 3.3 70BPrivacy-sensitive · streaming
Draft reply generationGroq / Llama 3.3 70BTone/style generation
Document Q&AGroq / Llama 3.3 70BPrivacy-sensitive · doc contents
Calendar agentOpenAI GPT-4Reliable multi-step function calling

Why two models?

Every AI operation is assigned to a model based on three criteria: latency, function calling reliability, and privacy sensitivity. Groq runs Llama 3.3 70B at ~500–800 tokens/sec — roughly 10× faster than OpenAI for the same model class. Groq/Llama handles all privacy-sensitive operations (message content, uploaded documents). OpenAI GPT-4 is reserved exclusively for calendar operations that require reliable structured output via function calling.

Why not GPT-4 everywhere?

Privacy optics. Users connecting Gmail, Slack, and uploading private documents are making a trust decision. Routing their message contents through OpenAI is a harder sell than an open-source model. "We use open-source AI for your sensitive data" is a credible commitment. Calendar queries are different — they involve creating structured events, not reading sensitive content, so GPT-4's superior function calling is worth the tradeoff there.

RAG Pipeline — Document Q&A

Chunking · Pinecone vector search · Groq answer generation with citations

Key decisiontopK=5 retrieval + 200-char chunk overlap = answers that never split at a boundary.

Full pipeline

Upload → extract text (pdf-parse / mammoth / tesseract OCR) → chunk into 1000-char segments with 200-char overlap → embed → Pinecone upsert. On query: embed question → Pinecone similarity search (topK=5) → retrieve raw chunk text from PostgreSQL → build LLM context → Groq generates answer with citations.

Why chunking with overlap?

200-char overlap prevents an answer from spanning a chunk boundary and being missed by retrieval — without overlap, a sentence split exactly at a boundary would appear in neither chunk at full strength. Chunk size 1000 chars (~200 words) balances retrieval precision against context completeness.

Per-user Pinecone namespace

Each user's documents are isolated in their own Pinecone namespace. Without this, a similarity search could surface chunks from another user's documents. Namespacing by userId ensures all queries are scoped to the authenticated user only — security without the overhead of separate indexes per user.

Why DocumentChunk in PostgreSQL too?

Pinecone stores vectors and metadata, not raw text. After Pinecone returns matching IDs, the answer pipeline fetches full chunk text from PostgreSQL with a single indexed query. Vector ops stay in Pinecone (fast ANN search), text storage stays in the relational DB (reliable, consistent retrieval).

Tab Q&A — Scraping Architecture

Chrome extension local extraction · Firecrawl for web app URL scraping

Key decisionExtension extracts DOM locally — authenticated pages like Gmail never leave the browser as a URL.

Two paths for tab content

The extension runs in the user's browser with DOM access. Content scripts extract text directly from live pages — no external service needed, and it works on authenticated pages (Gmail, banking) because the user is already logged in. The web app can't access browser tabs, so users paste URLs and the server fetches them via Firecrawl, which handles JS-heavy SPAs that return skeleton HTML on a plain fetch().

Smart domain detection & cache strategy

60+ domain patterns are classified as JS-heavy (SPAs, dynamic content) — Firecrawl auto-enabled for these, plain fetch for static sites. Cache TTL varies by content type: 0s for financial data (Bloomberg, Yahoo Finance — always fresh), 2–5min for news, 1h for documentation.

40% speed improvement

Firecrawl timeout 30s→45s (reduces silent failures on slow SPAs), content limit 20KB→50KB (more context = better answers), 5 concurrent scrapes with Promise.all instead of sequential, and the LLM URL selection pre-step removed for the web app (saves 1–2s per request).

Multi-Provider OAuth

Google · Microsoft · Slack · Discord · Zoom — one token refresh pattern

Key decisionApp auth (JWT) and API tokens are intentionally separate concerns — next-auth conflates them.

Why not next-auth?

next-auth manages "who is logged in to ASight" — not "what APIs can ASight call on the user's behalf." These are separate concerns. App auth uses JWT (email + password, bcrypt). API tokens are per-service OAuth tokens in the DB for Gmail/Slack/etc API calls. Mixing them into next-auth sessions would require custom session callbacks per provider — more complex than the direct approach.

Transparent token refresh

All OAuth tokens are stored with expiresAt. MessageService.getAccessToken() checks expiry and transparently refreshes before every API call — callers never handle refresh themselves. Microsoft/Slack refresh tokens don't expire. Google expires after 6 months of non-use. Discord has a 30-day TTL.

Inbox Aggregation

Gmail · Slack · Microsoft Teams · Discord — normalized into one interface

Key decisionFour different message shapes normalized into one interface before any service layer sees them.

Normalization layer

Gmail, Slack, Teams, and Discord all return different message shapes. MessageService normalizes them into a unified Message interface before they reach any service layer: { id, service, from, subject?, channel?, body, receivedAt, isRead, threadId? }. The LLM summarization prompt receives this normalized array — it never sees platform-specific fields.

Why server-side sync (not streaming to frontend)?

An earlier design streamed messages from Gmail/Slack directly to the frontend. Two problems: (1) CORS — the frontend can't call Gmail's API with server-side tokens; (2) rate limits — every page load would trigger platform API calls. Current design fetches server-side, stores in MessageRollup, serves from DB. Frontend makes one API call to /api/inbox. Sync is triggered explicitly or on interval.

Calendar Agent with Function Calling

OpenAI GPT-4 function calling → Google Calendar API

Key decision"Schedule a team sync tomorrow at 2pm" → structured ISO datetime JSON, no parsing needed.

Why function calling for calendar?

Calendar operations require precise structured output — ISO datetime, timezone, duration, attendees, recurrence rules. Free-text LLM output for these parameters is unreliable. GPT-4's function calling API produces reliable JSON tool calls that CalendarService passes directly to the Google Calendar API without any parsing step.

Multi-turn conversation + timezone

"What do I have tomorrow?" followed by "Move the 3pm one to 4pm" requires the agent to remember context. A conversation history array is passed with each GPT-4 call. The user's timezone from Intl.DateTimeFormat().resolvedOptions().timeZone is in the system prompt so relative expressions ("next Tuesday") resolve to correct ISO 8601 datetimes.

Chrome Extension Architecture

React/Vite · same REST API as web app · local DOM extraction

Key decisionVite, not Next.js — Chrome requires a static bundle; Next.js compiles to a server-rendered app.

Why Vite (not Next.js)?

Chrome extensions require a static bundle with an explicit manifest.json. Next.js compiles to a server-rendered app — not a Chrome-compatible static bundle. Vite builds a clean static bundle with React that the extension manifest can reference directly.

Session sharing & CORS isolation

The extension stores the JWT token in localStorage (extension context). Both extension and web app call the same /api/* endpoints with the same JWT. CORS middleware explicitly allows chrome-extension://[id] as an origin, preventing other websites from making credentialed requests using the user's session.

Privacy-First Design

Open-source LLM for sensitive data · local tab extraction · 7-day document TTL

Key decisionPrivacy commitments enforced by code, not policy — cleanup cron, server-side tokens, LLM routing.

LLM choice as a privacy signal

Routing private messages and documents through Llama (open-source, via Groq) instead of OpenAI is a deliberate product decision. For a tool that accesses Gmail and private documents, "we use open-source AI for your sensitive data" is a credible, verifiable commitment — not just a policy claim.

Local tab extraction

Browser tab content is extracted locally by the content script and sent as text — not as a URL for the server to fetch. The server never receives session cookies or authenticated page URLs. Banking sites, social feeds, and healthcare portals are on a server-side blocklist that returns a warning instead of extracting content.

7-day document auto-deletion

DocumentCleanupService runs a daily cron to delete documents older than 7 days from both PostgreSQL and Pinecone. Enforced by code, not policy. OAuth tokens are stored server-side only — the extension never handles raw OAuth tokens, preventing exfiltration via XSS.

Stack Choices

Next.js monolith · Prisma · Pinecone · CSS modules

Key decisionExplicit decision to not use a component library — the design system is custom, not themed.

Why a single Next.js monolith

API routes live in app/api/ and share TypeScript types with the frontend. The Chrome extension calls the same /api/* endpoints as the web app — one codebase, one deploy, one auth system. No coordination overhead between two repos, no API contract drift.

Why CSS modules (not a component library)

The design system is custom — forest green / icy cyan palette, glass morphism cards, animated gradients. MUI or HeroUI would have required extensive theming overrides. CSS modules per page give precise control over animations without fighting a framework's opinions.

Why Pinecone over pgvector

Pinecone is managed — no index tuning, no extension to maintain, no I/O competition with SQL queries. pgvector co-locates vector and relational storage, meaning a large similarity search could compete for I/O with regular DB queries. Decoupling is the right call at this scale.

Database Design

PostgreSQL · Prisma ORM · UUID PKs · conversation threading

Key decisionUUIDs allow client-side ID generation before any DB round-trip — useful for optimistic UI.

Schema overview

Organized around 5 use cases + core auth: User · OAuthToken · Conversation · TabQALog · MessageRollup · DraftMessage · CalendarAction · PeekWindow · DocumentSource · DocumentChunk · DocumentQALog. All Q&A interactions thread through Conversation with a useCase field — a single query powers the unified history sidebar.

UUID primary keys everywhere

For an application syncing with external systems (Gmail message IDs, Google Calendar event IDs), integer PKs risk ID collision or confusion between internal and external IDs. UUIDs have no such risk and can be generated before a DB round-trip — useful for optimistic UI updates.

Why DocumentChunk in PostgreSQL (not only Pinecone)

Pinecone stores vector embeddings and metadata, not raw text. After Pinecone returns matching IDs, the answer pipeline fetches full chunk text from PostgreSQL in a single indexed query. Vector ops in Pinecone, text in the relational DB — each storage engine does what it's best at.

Deployment

VPS (Docker + Nginx) · Vercel · PostgreSQL · Pinecone

Key decisionVPS for long-running jobs (Vercel 10s timeout is too short for document indexing and inbox sync).

VPS for persistent processes

The web app runs on a VPS via Docker (Next.js + Nginx reverse proxy). Vercel serverless functions have a 10s timeout — insufficient for document indexing and inbox sync. The VPS handles long-running operations. Nginx provides SSL termination via Let's Encrypt and serves static assets.

Why cleanup cron runs in-process

The daily document cleanup cron runs inside the Next.js process via node-cron. A separate worker or external scheduler (Render cron, GitHub Actions) would add infrastructure complexity for one once-a-day task. If the server is down at cleanup time, documents are deleted the next day — a one-day delay has no user-visible impact.

Summary of Key Tradeoffs

Every significant decision had a real cost — here they are in one place.

DecisionBenefitCost
Groq/Llama for sensitive opsPrivacy story + low latency + low costLess reliable structured output than GPT-4
GPT-4 for calendar onlyReliable function calling for structured eventsHigher cost per calendar query
Pinecone for vectorsManaged, fast, per-user namespace isolationAnother vendor dependency + ongoing cost
Direct OAuth (no library)Full control per provider, no abstraction frictionMore boilerplate per new provider
Tab extraction via content scriptsWorks on authenticated pages, local privacyExtension required; web app needs Firecrawl
7-day document auto-deletionStrong privacy default, enforced by codeUsers must re-upload for persistent Q&A
CSS modules (no component lib)Full design control, custom animationsMore CSS to maintain
Single Next.js monolithOne codebase, shared types, one deployLong-running jobs need VPS (not Vercel)

Want to see it in action?

All of this architecture, live in a working product.

Try ASight Free▶ Watch DemoBack to site