MDX Limo
Content Machine - Addendum 1

ContentEngine — Technical Addendum #1

LangExtract Implementation & Scope

AuthorAlton Wells
DateMarch 2026
StatusDraft for Review
Parent SpecContentEngine Technical Specification v3
ScopeLangExtract extraction layer: integration, schemas, triggers, cost, QA

1. Purpose & Scope

This addendum defines the exact scope, integration pattern, extraction schemas, trigger logic, quality assurance strategy, and cost model for LangExtract within the ContentEngine MVP. It supersedes any conflicting detail in the parent specification and constrains implementation to the minimum viable extraction layer.

1.1 What This Addendum Covers

  • Integration architecture: Python sidecar vs. Node SDK decision and rationale
  • Six extraction class definitions with full attribute schemas and few-shot example structure
  • Processing parameters: passes, chunking, concurrency per document type
  • Trigger model: when extraction runs and what initiates it
  • Quality assurance: validation, drift monitoring, regression testing
  • Cost model: token-level estimates based on actual document volumes

1.2 What This Addendum Does Not Cover

  • Filesystem-as-context architecture (deferred per project decision)
  • Hierarchical summary generation (separate addendum)
  • Graph relationship builder logic (separate addendum)
  • Agent prompt engineering (separate addendum)

1.3 Key MVP Constraint

Competitor content is extracted once on discovery. Weekly monitoring detects new/changed pages only. Re-extraction of unchanged content is out of scope for MVP.


2. Integration Architecture

2.1 Decision: Python Sidecar (Not Node SDK)

The parent spec calls for a FastAPI sidecar running LangExtract in Python. An unofficial Node.js SDK exists. This addendum recommends the Python sidecar for the following reasons:

  • The official LangExtract library is Python-only (google/langextract, Apache 2.0, 34.4k GitHub stars). The Node SDK is unofficial, community-maintained, and has no guaranteed feature parity.
  • Critical features used in ContentEngine are Python-only: multi-pass extraction (extraction_passes), cross-chunk coreference resolution, Vertex AI batch processing, controlled generation via Gemini schema constraints.
  • The Node SDK documentation explicitly states it requires fence_output=True and use_schema_constraints=False for non-Gemini models, suggesting incomplete schema enforcement.
  • LangExtract version 1.1.1 is current. The cross-chunk context awareness feature (coreference resolution) shipped 3 months ago and is not present in the Node SDK.

Risk acceptance: The Python–TypeScript bridge is a single point of failure. Mitigation is defined in Section 6 (health checks, circuit breaker, graceful degradation).

2.2 Sidecar Service Design

The LangExtract service runs as a standalone FastAPI application deployed on Railway alongside the Mastra agent workers. It exposes a minimal HTTP API that Mastra tools call via fetch.

Endpoints

EndpointMethodPurpose
POST /extractPOSTSingle document extraction with class-specific config
POST /extract/batchPOSTMulti-document batch extraction
GET /healthGETLiveness + model connectivity check
GET /schemasGETReturns current extraction class definitions (for debugging)

Request Schema: POST /extract

1{ 2 "document_type": "competitor_page" | "our_page" | "serp" | 3 "ai_overview" | "brand_voice" | "keyword_data", 4 "text": "string (raw text content)", 5 "url": "string (optional, source URL for provenance)", 6 "extraction_overrides": { 7 "extraction_passes": "number (default per doc type)", 8 "max_char_buffer": "number (default per doc type)", 9 "max_workers": "number (default: 10)" 10 } 11}

The service selects the correct prompt_description, examples, and processing parameters based on document_type. Overrides allow per-request tuning without redeploying the service.

Response Schema

1{ 2 "extractions": [ 3 { 4 "extraction_class": "string", 5 "extraction_text": "string (verbatim from source)", 6 "attributes": { "key": "value" }, 7 "source_location": { "start": 0, "end": 42 } 8 } 9 ], 10 "document_length": 14523, 11 "extraction_count": 37, 12 "processing_time_ms": 2840, 13 "passes_completed": 2, 14 "model_id": "gemini-2.5-flash" 15}

3. Extraction Class Definitions

ContentEngine defines six extraction classes. Each class has a fixed schema, a dedicated prompt_description, and a minimum of three few-shot examples. Few-shot examples are versioned in the repository under /langextract/examples/ and are loaded at service startup.

3.1 competitor_page

Trigger: New or changed competitor page detected by weekly Firecrawl sitemap diff.

Input: Raw text extracted by Firecrawl (HTML stripped, JS rendered). Typical length 1,500–10,000 words.

Extraction entities per page:

Entity Classextraction_text SourceRequired Attributes
topicSection heading or topic phrasedepth (shallow|deep), section_position, parent_topic
claimSpecific factual or statistical claimclaim_type (stat|opinion|fact), source_cited (bool), specificity (high|medium|low)
keyword_signalPhrase appearing in H1/H2/title/metaplacement (h1|h2|title|meta|body), estimated_intent (info|nav|transact|commercial)
content_structureStructural element descriptionelement_type (table|list|image|video|code|calculator|tool), purpose, word_count_estimate
ctaCall-to-action textcta_type (link|button|form|download), target_action, position (above_fold|inline|footer)
entity_referenceNamed entity (product, person, brand)entity_type (product|person|brand|tool), sentiment (pos|neutral|neg), context

Design note: We extract structural elements (tables, calculators, videos) as entities rather than ignoring them. This lets the Strategy Agent know WHAT competitors are doing, not just what they wrote. A competitor page with an interactive ROI calculator is strategically different from one with only text.

3.2 our_page

Trigger: Post-publish pipeline (new content) or initial system bootstrap (existing content inventory).

Uses the same entity classes as competitor_page, plus one additional:

Entity Classextraction_text SourceRequired Attributes
internal_linkAnchor text of outbound internal linktarget_url, context_sentence, anchor_type (exact_match|partial|branded|generic)

The internal_link extraction feeds directly into the content relationship graph (links_to edges) and validates against the SEO check for anchor text diversity.

3.3 serp

Trigger: Weekly SERP snapshot job for tracked keywords.

Entity Classextraction_text SourceRequired Attributes
serp_resultTitle + snippet of organic resultposition, url, domain, result_type (organic|featured_snippet|paa|video|image_pack)
serp_featureFeature element on SERPfeature_type (ai_overview|featured_snippet|paa|knowledge_panel|local_pack), our_site_present (bool)
paa_questionPeople Also Ask question textposition_in_paa, related_to_primary_keyword (bool)

3.4 ai_overview

Trigger: Detected during SERP snapshot when serp_feature with feature_type=ai_overview is present.

Entity Classextraction_text SourceRequired Attributes
aio_claimIndividual claim within AI Overviewcited_source_url, cited_source_domain, claim_type (stat|fact|recommendation), our_site_cited (bool)
aio_structureStructural pattern of the overviewformat (paragraph|list|table|steps), length_estimate (short|medium|long), source_count

Note: The SEO research notes show 97% of AI Overview citations come from pages already ranking in the top 20. Extracting cited sources lets the Strategy Agent prioritize pages that are citation-eligible based on current rank.

3.5 brand_voice

Trigger: On strategy creation or update. Run once against a human-curated sample set (5–10 representative pieces).

Entity Classextraction_text SourceRequired Attributes
tone_markerSentence or phrase exemplifying tonetone_quality (authoritative|conversational|technical|witty|empathetic), intensity (strong|moderate|subtle)
vocabulary_preferenceDistinctive word or phrase choicecategory (jargon|branded_term|colloquial|formal), frequency (always|often|sometimes), avoid (bool)
sentence_patternRepresentative sentence structurepattern_type (short_declarative|compound|rhetorical_question|list_intro), typical_length_words

Brand voice extraction runs infrequently and on small document sets. Single-pass extraction at max_char_buffer=2000 is sufficient. This is the lowest-cost extraction class.

3.6 keyword_data

This class may not require LangExtract at all. Semrush/Ahrefs API responses are already structured JSON. LangExtract is only used here if we need to extract keyword intent and topical clustering from unstructured keyword research notes or analyst reports. For MVP, this class is deferred — keyword data enters the system directly from API responses via Mastra tools.


4. Processing Parameters by Document Type

Each document type has default processing parameters tuned to its typical length, complexity, and extraction density. These are configurable per-request via extraction_overrides but defaults should be correct for 90%+ of cases.

Document Typeextraction_passesmax_char_buffermax_workersRationale
competitor_page2150010Web content has mixed structure (nav, CTAs, sidebars). 1500-char buffer avoids splitting mid-section. 2 passes balances recall vs. cost.
our_page1150010Our content is cleaner (no nav/sidebar noise in CMS body). Single pass sufficient; we control the source quality.
serp15005SERP snapshots are short, highly structured. Small buffer keeps each result isolated.
ai_overview210005AI Overviews are concise but citation-dense. 2 passes improve citation recall.
brand_voice120003Small corpus, longer context helps preserve sentence-level patterns. Low concurrency is fine.

4.1 Why 1500 Characters for Web Content

The parent spec and LangExtract's Romeo & Juliet example use max_char_buffer=1000. For literary text, this works because paragraphs are self-contained. Web content is different: a single section with an H2 heading, introductory paragraph, and supporting table can easily span 1,200–1,800 characters. At 1000-char chunks, the heading is separated from its content, breaking the extraction context.

Testing against actual competitor pages in our target verticals is required before finalizing this parameter. The acceptance criteria: extraction of topic entities must preserve the association between section headings and their content in 95%+ of cases.

Action item: Before implementation, run LangExtract against 10 representative competitor pages at buffer sizes of 1000, 1500, and 2000. Measure topic extraction accuracy (heading-content association) at each size. Document results in a testing log.

4.2 Cross-Chunk Coreference Resolution

LangExtract v1.1.0 added cross-chunk context awareness for coreference resolution. This feature is critical for competitor page extraction: a page may introduce "our platform" in the first paragraph and reference "it" for the next 3,000 words. Without coreference, chunks 2+ lose entity context.

For MVP, enable cross-chunk coreference for competitor_page and our_page document types. Leave disabled for serp, ai_overview, and brand_voice (these are short or independently structured documents where cross-chunk context adds cost without benefit).


5. Trigger Model

Extraction does not run on a schedule. It runs in response to specific events. This is the core MVP constraint: extract once on discovery, not continuously.

Trigger EventExtraction ClassFrequencyInitiated By
New competitor page detectedcompetitor_pageOn discovery (weekly scan)Trigger.dev job: weekly-sitemap-diff
Changed competitor page detectedcompetitor_pageOn detection (weekly scan)Trigger.dev job: weekly-sitemap-diff
New content published (ours)our_pageOn publish eventPublishing Agent post-publish pipeline
System bootstrap (existing content)our_pageOnce at system initializationManual script / Trigger.dev one-time job
Weekly SERP snapshotserp + ai_overviewWeekly per tracked keywordTrigger.dev job: weekly-serp-snapshot
Strategy created/updatedbrand_voiceOn strategy change eventContent Strategy settings UI save action

5.1 The Weekly Sitemap Diff Job

This is the only recurring extraction trigger for competitor content. The job logic:

  • Firecrawl crawls each competitor's sitemap (or site structure if no sitemap).
  • Compare returned URLs + content hashes against competitor_pages table.
  • New URLs (not in table): mark as new, queue for extraction.
  • Existing URLs with changed content_hash: mark as changed, queue for re-extraction. Old extraction rows are soft-deleted (retained for historical comparison), new extraction replaces them.
  • Existing URLs with unchanged content_hash: skip entirely. No extraction cost.
  • Removed URLs (in table but not in sitemap): mark as removed in competitor_changes. No extraction needed.

Cost implication: If a competitor publishes 5 new pages per week and updates 3 existing ones, that's 8 extraction jobs per competitor per week. At 10 competitors, that's ~80 extractions/week — well within budget.

5.2 The Post-Publish Extraction Pipeline

When the Publishing Agent pushes content to the CMS, the following extraction chain fires:

  1. CMS confirms publish success (HTTP 200/201).
  2. Trigger.dev job fires: fetch published page raw text via Firecrawl or CMS API.
  3. POST /extract with document_type=our_page.
  4. On extraction success: write rows to our_page_extractions table.
  5. Emit event: extraction_complete with page_id. Downstream consumers (summary generation, graph builder) subscribe to this event.

Expected latency from publish to extraction_complete: 30–90 seconds depending on page length. Summary regeneration and graph updates are separate downstream jobs that do not block the extraction pipeline.


6. Quality Assurance & Reliability

6.1 Extraction Validation

Every extraction response from the sidecar is validated before being persisted to Postgres. Validation is deterministic code, not LLM-based.

  • Schema validation: Every extraction must have a valid extraction_class, non-empty extraction_text, and attributes matching the required schema for that class. Malformed extractions are logged and dropped.
  • Source grounding check: extraction_text must appear verbatim in the source document (within the text field of the request). LangExtract returns source_location; we verify it. Extractions that fail grounding are flagged for review.
  • Duplicate detection: If two extractions in the same response have identical extraction_class + extraction_text, keep the one with richer attributes and drop the other.
  • Minimum extraction threshold: If a document of 2,000+ words produces fewer than 5 extractions, flag it as a potential extraction failure. Do not persist; queue for re-extraction with extraction_passes + 1.

6.2 Sidecar Health & Circuit Breaker

The Mastra tool that calls the LangExtract sidecar implements a circuit breaker pattern:

  • Health check: GET /health every 60 seconds. Health check verifies FastAPI is responding AND Gemini API key is valid (makes a minimal extraction call).
  • Circuit states: CLOSED (normal), OPEN (failing — all requests return cached fallback or error), HALF-OPEN (testing recovery with single request).
  • Threshold: 3 consecutive failures or >50% failure rate in a 5-minute window opens the circuit.
  • Recovery: After 60 seconds in OPEN, move to HALF-OPEN. One successful request closes the circuit.
  • Fallback when open: Extraction jobs are queued in Trigger.dev for retry when the circuit closes. No data is lost; publication does not proceed without extraction.

6.3 Gemini Model Version Drift

LangExtract uses Gemini 2.5 Flash. Google retires model versions on a defined lifecycle. When a model version changes, extraction output may differ even with identical inputs and examples.

Mitigation strategy:

  • Pin model version explicitly in the sidecar config (model_id="gemini-2.5-flash" with version suffix when available).
  • Maintain a regression test suite: 10 documents (2 per extraction class) with known-good extraction outputs. Run the suite on every sidecar deployment and on a weekly schedule.
  • Regression threshold: If >15% of expected extractions are missing or >10% of attributes have changed, block deployment and alert the team.
  • Model migration process: When a new Gemini version is available, run the regression suite against it, compare outputs, adjust few-shot examples if needed, then cut over.

The few-shot examples ARE the specification. If extraction quality drifts, the first response is always to review and improve examples, not to add more extraction passes or change parameters.

6.4 Few-Shot Example Management

Few-shot examples are the most important artifact in the extraction pipeline. They are treated as code: versioned in git, reviewed in PRs, and tested against the regression suite.

  • Minimum 3 examples per extraction class. Target 5 for competitor_page (highest variety of input structures).
  • Examples must use real data from our actual competitor landscape, not synthetic text. Anonymize if needed, but structure and complexity must be representative.
  • Each example must include at least one edge case (e.g., a page with no H2 headings, a claim with no citation, a CTA buried in body text).
  • Examples are stored as Python data structures in /langextract/examples/{class_name}.py. The sidecar loads them at startup.
  • Any change to examples requires running the regression suite before merge.

7. Cost Model

All costs use Gemini 2.5 Flash pricing: 0.15per1Minputtokens,0.15 per 1M input tokens, 0.60 per 1M output tokens (as of March 2026). Verify current pricing before implementation.

7.1 Per-Extraction Cost Estimates

Document TypeAvg Input TokensPassesAvg Output TokensCost Per Doc
competitor_page~6,0002~1,500$0.0036
our_page~5,0001~1,200$0.0015
serp~1,5001~800$0.0007
ai_overview~1,0002~600$0.0010
brand_voice~4,0001~1,000$0.0012

Note: Input tokens include the prompt_description + few-shot examples + document chunk. Few-shot examples add ~500–1,500 tokens per request depending on class. This overhead is multiplied per chunk.

7.2 Monthly Volume Estimates (MVP Scale)

Assumptions: 10 competitors, ~8 new/changed pages per competitor per week, 50 pieces published by us per month, 100 tracked keywords.

Extraction ClassMonthly VolumeCostNotes
competitor_page~320 docs$1.1580/week × 4 weeks
our_page~50 docs$0.0850 published pieces
serp~400 snapshots$0.28100 keywords × 4 weeks
ai_overview~100 extractions$0.10Estimated 25% of SERPs have AIO
brand_voice~2 runs$0.002Rare; only on strategy change
TOTAL$1.62/monthGemini extraction only; excludes Firecrawl, Semrush, hosting

The parent spec estimated 50100/monthforLangExtract+summarygeneration.At50–100/month for LangExtract + summary generation. At 1.62/month for extraction alone, extraction is a negligible cost center. The majority of that $50–100 budget will be consumed by hierarchical summary generation (a separate, more token-intensive process defined in a future addendum).


8. Open Questions for Review

The following decisions are flagged for team review before implementation begins:

#QuestionImpactRecommended Default
1Should extraction_passes=2 be the default for competitor_page, or should we start with 1 and upgrade only if recall is insufficient?Cost (2×) vs. recallStart with 1, measure recall on first 50 pages, upgrade if <85% topic coverage
2The max_char_buffer of 1500 is a hypothesis. Do we run the buffer size experiment (Section 4.1) before or during implementation?Blocks schema finalizationBefore. Allocate 2 days for testing against 10 real competitor pages
3Should we extract from Firecrawl's raw text output or from a cleaned version that strips navigation, footers, and sidebars?Extraction noise levelUse Firecrawl's main content extraction (not raw HTML). Test quality of Firecrawl's content extraction first
4The brand_voice extraction runs against 5–10 curated samples. Who selects those samples, and what's the selection criteria?Voice extraction qualityContent lead selects. Criteria: pieces that best represent the target voice, not the highest-performing pieces
5Do we need a separate extraction class for FAQ/PAA-style content on competitor pages, or does the existing claim + topic schema cover it?Schema complexityDefer. The claim class with claim_type=fact covers FAQ-style content adequately for MVP
6Cross-chunk coreference adds latency. Acceptable threshold for extraction time per document?User experience / pipeline speed90 seconds max per document. If coreference pushes beyond this, disable and accept lower entity resolution

End of Addendum #1

Next addendum: Hierarchical Summary Generation — defining how LangExtract outputs are aggregated into navigable Level 0–3 summaries for agent consumption.

Content Machine - Addendum 1 | MDX Limo