MDX Limo
Granola Audio Transcription System Architecture

Granola Audio Transcription System Architecture

Granola v7.65.0 — Complete Audio/Transcription Architecture


End-to-End Data Flow

1. Audio Capture (Native Layer)

The native module granola.node (Obj-C/C++) handles all audio capture through CombinedAudioCapture:

  • ScreenCaptureKitListener (SCStream API) — captures system audio (what you hear), i.e. other participants' voices
  • AudioListener (AVFAudio) — captures your microphone input
  • WebRTC AEC3 — echo cancellation that separates your voice from system playback, removing speaker bleed from the mic signal

Audio format: 16-bit PCM, 480-sample chunks (~10ms frames). Each frame callback delivers a pair of raw PCM buffers: {microphoneBuffer, systemAudioBuffer}.


2. Audio Processing (Worker Thread)

audio_process/index.js runs in a worker thread and receives buffers via parentPort.on("message").

Responsibilities:

  • Calculates RMS volume per buffer
  • Applies auto-gain compensation
  • Forwards buffers to the main process
  • Logs metrics every 100K frames

Events handled: start-audio-capture, stop-audio-capture, request-microphone-permission, request-system-audio-permission.

Output: postMessage({event: "audio-capture-buffer", microphoneBuffer, systemAudioBuffer})


3. Main Process (Electron)

main/index-D9Gey83_.js receives audio buffers via IPC and routes them down two parallel paths:

Path A — Real-time Streaming: Selects a transcription provider, obtains an auth token from the backend, and opens a WebSocket connection to the provider.

Path B — Audio File Upload: Requests a presigned S3 URL via POST /v1/request-audio-upload, then PUTs the audio directly to S3.


4A. AWS S3 (Audio Storage)

Raw audio is uploaded via presigned URL with SigV4 credentials. Content type is audio/wav or mp4; presigned URLs expire in 1 hour.

This upload is always active when recording — it is not optional. Audio is stored on Granola's S3.


4B. Transcription Providers (Real-time WebSocket)

All providers receive raw Int16 PCM binary frames and return partial + final transcript JSON. No client-side encryption is applied to audio before transmission.

Provider Fallback Chain

PriorityProviderEndpointAuth Method
PrimaryDeepgram (Nova-2)wss://api.deepgram.com/v1/listen / wss://d315810d.api.deepgram.com/v1/listen (dedicated)Bearer token in headers; refreshed every 4 hr via GET /v1/get-deepgram-token
SecondaryAssemblyAIwss://streaming.assemblyai.com/v3/wsBearer token from GET /v1/get-transcription-auth; cached 5 min with 30s refresh buffer
TertiarySpeechmaticswss://eu2.rt.speechmatics.com/v2JWT passed as ?jwt= query parameter (in the URL)

Deepgram response shape:

1{"type": "Results", "is_final": true, "speech_final": true, "alternatives": [{"transcript": "...", "confidence": 0.98}]}

AssemblyAI response shape:

1{"message_type": "...", "transcript": "...", "confidence": 0.97, "words": [...]}

If the primary provider fails, the system falls through to the next in the chain.


5. Transcript Handling (Main Process)

The main process parses each provider's response into a normalized shape:

1{"text": "...", "isFinal": true, "isUtteranceComplete": true, "confidence": 0.98, "words": []}

Two parallel output paths:

  • Path A → Renderer (IPC): webContents.send("granola-talk:transcription", {text, isFinal}) for real-time UI updates.
  • Path B → Granola Backend: POST /v1/insert-transcriptions with {document_id, chunks: [{text, start_time, end_time, speaker, confidence}]}. Failed chunks are batched and retried automatically.

6. Renderer (UI Display)

granolatalk-CZslIZY1.js receives events via IPC:

  • granola-talk:transcription — live text updates
  • granola-talk:connection-state — connection status indicator

Displays partial transcripts (live), final transcripts (committed), and connection state.


The Granola Backend WebSocket

Separate from the transcription WebSocket, a ReconnectingWebSocket connects to Granola's own backend:

  • Primary: wss://5p69hiii4m.execute-api.us-east-1.amazonaws.com/prod
  • Public: wss://n71xi8mtih.execute-api.us-east-1.amazonaws.com/prod (shared/public document access)

Backed by AWS API Gateway → Lambda → DynamoDB/RDS.

This WebSocket is not used for audio. It handles:

  • Document collaboration (Y.js)
  • Cursor positions / presence
  • Live transcript chunk sync
  • Chat message streaming

Reconnect strategy: exponential backoff, max 10s, 1.3× multiplier.


Where Your Data Ends Up

Data TypeDestinationPersisted?User Controls Deletion?
Raw audio streamDeepgram / AssemblyAI / SpeechmaticsUnclear — depends on provider retention policiesNo (provider-controlled)
Raw audio fileAWS S3 (Granola-controlled bucket)YesVia hard-delete-document
Transcript chunksGranola PostgreSQL DBYesVia delete-transcription-chunks
Transcript embeddingsTurbopuffer vector DB (semantic search)YesVia hard-delete-document
Document/notesGranola DBYesVia hard-delete-document
Session replayLogRocketYesNo (Granola-controlled)
Behavioral analyticsAmplitudeYesNo (Granola-controlled)
Device fingerprintFingerprintJSYesNo

Key Questions & Answers

How does real-time transcription work? Raw PCM audio is streamed over WebSocket to Deepgram (primary). Deepgram returns partial transcript JSON every ~100–300ms and final results at utterance boundaries. These are sent to the renderer via Electron IPC for immediate display. Latency is typically sub-second.

Is audio processed locally? Only echo cancellation (WebRTC AEC3) runs locally. All speech-to-text is cloud-based. Raw audio leaves your machine unencrypted over WSS.

Who has your audio? At minimum: the active transcription provider (Deepgram by default) and Granola (via S3 upload). Audio upload appears to be standard behavior, not optional.

Is there a privacy mode? The API has GET/SET /v1/get-privacy-mode and consent endpoints, but based on the architecture, this likely controls sharing/collaboration rather than preventing audio upload.

What about the Speechmatics JWT-in-URL issue? Speechmatics auth tokens are passed as query parameters (?jwt=...), which means they appear in server access logs, proxy logs, and potentially browser history. This is a security concern compared to Deepgram's header-based auth.

Granola Audio Transcription System Architecture | MDX Limo