Granola Audio Transcription System Architecture
Granola v7.65.0 — Complete Audio/Transcription Architecture
End-to-End Data Flow
1. Audio Capture (Native Layer)
The native module granola.node (Obj-C/C++) handles all audio capture through CombinedAudioCapture:
- ScreenCaptureKitListener (SCStream API) — captures system audio (what you hear), i.e. other participants' voices
- AudioListener (AVFAudio) — captures your microphone input
- WebRTC AEC3 — echo cancellation that separates your voice from system playback, removing speaker bleed from the mic signal
Audio format: 16-bit PCM, 480-sample chunks (~10ms frames). Each frame callback delivers a pair of raw PCM buffers: {microphoneBuffer, systemAudioBuffer}.
2. Audio Processing (Worker Thread)
audio_process/index.js runs in a worker thread and receives buffers via parentPort.on("message").
Responsibilities:
- Calculates RMS volume per buffer
- Applies auto-gain compensation
- Forwards buffers to the main process
- Logs metrics every 100K frames
Events handled: start-audio-capture, stop-audio-capture, request-microphone-permission, request-system-audio-permission.
Output: postMessage({event: "audio-capture-buffer", microphoneBuffer, systemAudioBuffer})
3. Main Process (Electron)
main/index-D9Gey83_.js receives audio buffers via IPC and routes them down two parallel paths:
Path A — Real-time Streaming: Selects a transcription provider, obtains an auth token from the backend, and opens a WebSocket connection to the provider.
Path B — Audio File Upload:
Requests a presigned S3 URL via POST /v1/request-audio-upload, then PUTs the audio directly to S3.
4A. AWS S3 (Audio Storage)
Raw audio is uploaded via presigned URL with SigV4 credentials. Content type is audio/wav or mp4; presigned URLs expire in 1 hour.
This upload is always active when recording — it is not optional. Audio is stored on Granola's S3.
4B. Transcription Providers (Real-time WebSocket)
All providers receive raw Int16 PCM binary frames and return partial + final transcript JSON. No client-side encryption is applied to audio before transmission.
Provider Fallback Chain
| Priority | Provider | Endpoint | Auth Method |
|---|---|---|---|
| Primary | Deepgram (Nova-2) | wss://api.deepgram.com/v1/listen / wss://d315810d.api.deepgram.com/v1/listen (dedicated) | Bearer token in headers; refreshed every 4 hr via GET /v1/get-deepgram-token |
| Secondary | AssemblyAI | wss://streaming.assemblyai.com/v3/ws | Bearer token from GET /v1/get-transcription-auth; cached 5 min with 30s refresh buffer |
| Tertiary | Speechmatics | wss://eu2.rt.speechmatics.com/v2 | JWT passed as ?jwt= query parameter (in the URL) |
Deepgram response shape:
1{"type": "Results", "is_final": true, "speech_final": true, "alternatives": [{"transcript": "...", "confidence": 0.98}]}AssemblyAI response shape:
1{"message_type": "...", "transcript": "...", "confidence": 0.97, "words": [...]}If the primary provider fails, the system falls through to the next in the chain.
5. Transcript Handling (Main Process)
The main process parses each provider's response into a normalized shape:
1{"text": "...", "isFinal": true, "isUtteranceComplete": true, "confidence": 0.98, "words": []}Two parallel output paths:
- Path A → Renderer (IPC):
webContents.send("granola-talk:transcription", {text, isFinal})for real-time UI updates. - Path B → Granola Backend:
POST /v1/insert-transcriptionswith{document_id, chunks: [{text, start_time, end_time, speaker, confidence}]}. Failed chunks are batched and retried automatically.
6. Renderer (UI Display)
granolatalk-CZslIZY1.js receives events via IPC:
granola-talk:transcription— live text updatesgranola-talk:connection-state— connection status indicator
Displays partial transcripts (live), final transcripts (committed), and connection state.
The Granola Backend WebSocket
Separate from the transcription WebSocket, a ReconnectingWebSocket connects to Granola's own backend:
- Primary:
wss://5p69hiii4m.execute-api.us-east-1.amazonaws.com/prod - Public:
wss://n71xi8mtih.execute-api.us-east-1.amazonaws.com/prod(shared/public document access)
Backed by AWS API Gateway → Lambda → DynamoDB/RDS.
This WebSocket is not used for audio. It handles:
- Document collaboration (Y.js)
- Cursor positions / presence
- Live transcript chunk sync
- Chat message streaming
Reconnect strategy: exponential backoff, max 10s, 1.3× multiplier.
Where Your Data Ends Up
| Data Type | Destination | Persisted? | User Controls Deletion? |
|---|---|---|---|
| Raw audio stream | Deepgram / AssemblyAI / Speechmatics | Unclear — depends on provider retention policies | No (provider-controlled) |
| Raw audio file | AWS S3 (Granola-controlled bucket) | Yes | Via hard-delete-document |
| Transcript chunks | Granola PostgreSQL DB | Yes | Via delete-transcription-chunks |
| Transcript embeddings | Turbopuffer vector DB (semantic search) | Yes | Via hard-delete-document |
| Document/notes | Granola DB | Yes | Via hard-delete-document |
| Session replay | LogRocket | Yes | No (Granola-controlled) |
| Behavioral analytics | Amplitude | Yes | No (Granola-controlled) |
| Device fingerprint | FingerprintJS | Yes | No |
Key Questions & Answers
How does real-time transcription work? Raw PCM audio is streamed over WebSocket to Deepgram (primary). Deepgram returns partial transcript JSON every ~100–300ms and final results at utterance boundaries. These are sent to the renderer via Electron IPC for immediate display. Latency is typically sub-second.
Is audio processed locally? Only echo cancellation (WebRTC AEC3) runs locally. All speech-to-text is cloud-based. Raw audio leaves your machine unencrypted over WSS.
Who has your audio? At minimum: the active transcription provider (Deepgram by default) and Granola (via S3 upload). Audio upload appears to be standard behavior, not optional.
Is there a privacy mode?
The API has GET/SET /v1/get-privacy-mode and consent endpoints, but based on the architecture, this likely controls sharing/collaboration rather than preventing audio upload.
What about the Speechmatics JWT-in-URL issue?
Speechmatics auth tokens are passed as query parameters (?jwt=...), which means they appear in server access logs, proxy logs, and potentially browser history. This is a security concern compared to Deepgram's header-based auth.