Granola Audio Transcription System Architecture

Granola v7.65.0 — Complete Audio/Transcription Architecture

End-to-End Data Flow

1. Audio Capture (Native Layer)

The native module granola.node (Obj-C/C++) handles all audio capture through CombinedAudioCapture:

ScreenCaptureKitListener (SCStream API) — captures system audio (what you hear), i.e. other participants' voices
AudioListener (AVFAudio) — captures your microphone input
WebRTC AEC3 — echo cancellation that separates your voice from system playback, removing speaker bleed from the mic signal

Audio format: 16-bit PCM, 480-sample chunks (~10ms frames). Each frame callback delivers a pair of raw PCM buffers: {microphoneBuffer, systemAudioBuffer}.

2. Audio Processing (Worker Thread)

audio_process/index.js runs in a worker thread and receives buffers via parentPort.on("message").

Responsibilities:

Calculates RMS volume per buffer
Applies auto-gain compensation
Forwards buffers to the main process
Logs metrics every 100K frames

Events handled: start-audio-capture, stop-audio-capture, request-microphone-permission, request-system-audio-permission.

Output: postMessage({event: "audio-capture-buffer", microphoneBuffer, systemAudioBuffer})

3. Main Process (Electron)

main/index-D9Gey83_.js receives audio buffers via IPC and routes them down two parallel paths:

Path A — Real-time Streaming: Selects a transcription provider, obtains an auth token from the backend, and opens a WebSocket connection to the provider.

Path B — Audio File Upload: Requests a presigned S3 URL via POST /v1/request-audio-upload, then PUTs the audio directly to S3.

4A. AWS S3 (Audio Storage)

Raw audio is uploaded via presigned URL with SigV4 credentials. Content type is audio/wav or mp4; presigned URLs expire in 1 hour.

This upload is always active when recording — it is not optional. Audio is stored on Granola's S3.

4B. Transcription Providers (Real-time WebSocket)

All providers receive raw Int16 PCM binary frames and return partial + final transcript JSON. No client-side encryption is applied to audio before transmission.

Provider Fallback Chain

Priority	Provider	Endpoint	Auth Method
Primary	Deepgram (Nova-2)	`wss://api.deepgram.com/v1/listen` / `wss://d315810d.api.deepgram.com/v1/listen` (dedicated)	Bearer token in headers; refreshed every 4 hr via `GET /v1/get-deepgram-token`
Secondary	AssemblyAI	`wss://streaming.assemblyai.com/v3/ws`	Bearer token from `GET /v1/get-transcription-auth`; cached 5 min with 30s refresh buffer
Tertiary	Speechmatics	`wss://eu2.rt.speechmatics.com/v2`	JWT passed as `?jwt=` query parameter (in the URL)

Deepgram response shape:

{"type": "Results", "is_final": true, "speech_final": true, "alternatives": [{"transcript": "...", "confidence": 0.98}]}

AssemblyAI response shape:

{"message_type": "...", "transcript": "...", "confidence": 0.97, "words": [...]}

If the primary provider fails, the system falls through to the next in the chain.

5. Transcript Handling (Main Process)

The main process parses each provider's response into a normalized shape:

{"text": "...", "isFinal": true, "isUtteranceComplete": true, "confidence": 0.98, "words": []}

Two parallel output paths:

Path A → Renderer (IPC): webContents.send("granola-talk:transcription", {text, isFinal}) for real-time UI updates.
Path B → Granola Backend: POST /v1/insert-transcriptions with {document_id, chunks: [{text, start_time, end_time, speaker, confidence}]}. Failed chunks are batched and retried automatically.

6. Renderer (UI Display)

granolatalk-CZslIZY1.js receives events via IPC:

granola-talk:transcription — live text updates
granola-talk:connection-state — connection status indicator

Displays partial transcripts (live), final transcripts (committed), and connection state.

The Granola Backend WebSocket

Separate from the transcription WebSocket, a ReconnectingWebSocket connects to Granola's own backend:

Primary: wss://5p69hiii4m.execute-api.us-east-1.amazonaws.com/prod
Public: wss://n71xi8mtih.execute-api.us-east-1.amazonaws.com/prod (shared/public document access)

Backed by AWS API Gateway → Lambda → DynamoDB/RDS.

This WebSocket is not used for audio. It handles:

Document collaboration (Y.js)
Cursor positions / presence
Live transcript chunk sync
Chat message streaming

Reconnect strategy: exponential backoff, max 10s, 1.3× multiplier.

Where Your Data Ends Up

Data Type	Destination	Persisted?	User Controls Deletion?
Raw audio stream	Deepgram / AssemblyAI / Speechmatics	Unclear — depends on provider retention policies	No (provider-controlled)
Raw audio file	AWS S3 (Granola-controlled bucket)	Yes	Via `hard-delete-document`
Transcript chunks	Granola PostgreSQL DB	Yes	Via `delete-transcription-chunks`
Transcript embeddings	Turbopuffer vector DB (semantic search)	Yes	Via `hard-delete-document`
Document/notes	Granola DB	Yes	Via `hard-delete-document`
Session replay	LogRocket	Yes	No (Granola-controlled)
Behavioral analytics	Amplitude	Yes	No (Granola-controlled)
Device fingerprint	FingerprintJS	Yes	No

Key Questions & Answers

How does real-time transcription work? Raw PCM audio is streamed over WebSocket to Deepgram (primary). Deepgram returns partial transcript JSON every ~100–300ms and final results at utterance boundaries. These are sent to the renderer via Electron IPC for immediate display. Latency is typically sub-second.

Is audio processed locally? Only echo cancellation (WebRTC AEC3) runs locally. All speech-to-text is cloud-based. Raw audio leaves your machine unencrypted over WSS.

Who has your audio? At minimum: the active transcription provider (Deepgram by default) and Granola (via S3 upload). Audio upload appears to be standard behavior, not optional.

Is there a privacy mode? The API has GET/SET /v1/get-privacy-mode and consent endpoints, but based on the architecture, this likely controls sharing/collaboration rather than preventing audio upload.

What about the Speechmatics JWT-in-URL issue? Speechmatics auth tokens are passed as query parameters (?jwt=...), which means they appear in server access logs, proxy logs, and potentially browser history. This is a security concern compared to Deepgram's header-based auth.