ContentEngine — Technical Addendum #1

LangExtract Implementation & Scope


Author	Alton Wells
Date	March 2026
Status	Draft for Review
Parent Spec	ContentEngine Technical Specification v3
Scope	LangExtract extraction layer: integration, schemas, triggers, cost, QA

1. Purpose & Scope

This addendum defines the exact scope, integration pattern, extraction schemas, trigger logic, quality assurance strategy, and cost model for LangExtract within the ContentEngine MVP. It supersedes any conflicting detail in the parent specification and constrains implementation to the minimum viable extraction layer.

1.1 What This Addendum Covers

Integration architecture: Python sidecar vs. Node SDK decision and rationale
Six extraction class definitions with full attribute schemas and few-shot example structure
Processing parameters: passes, chunking, concurrency per document type
Trigger model: when extraction runs and what initiates it
Quality assurance: validation, drift monitoring, regression testing
Cost model: token-level estimates based on actual document volumes

1.2 What This Addendum Does Not Cover

Filesystem-as-context architecture (deferred per project decision)
Hierarchical summary generation (separate addendum)
Graph relationship builder logic (separate addendum)
Agent prompt engineering (separate addendum)

1.3 Key MVP Constraint

Competitor content is extracted once on discovery. Weekly monitoring detects new/changed pages only. Re-extraction of unchanged content is out of scope for MVP.

2. Integration Architecture

2.1 Decision: Python Sidecar (Not Node SDK)

The parent spec calls for a FastAPI sidecar running LangExtract in Python. An unofficial Node.js SDK exists. This addendum recommends the Python sidecar for the following reasons:

The official LangExtract library is Python-only (google/langextract, Apache 2.0, 34.4k GitHub stars). The Node SDK is unofficial, community-maintained, and has no guaranteed feature parity.
Critical features used in ContentEngine are Python-only: multi-pass extraction (extraction_passes), cross-chunk coreference resolution, Vertex AI batch processing, controlled generation via Gemini schema constraints.
The Node SDK documentation explicitly states it requires fence_output=True and use_schema_constraints=False for non-Gemini models, suggesting incomplete schema enforcement.
LangExtract version 1.1.1 is current. The cross-chunk context awareness feature (coreference resolution) shipped 3 months ago and is not present in the Node SDK.

Risk acceptance: The Python–TypeScript bridge is a single point of failure. Mitigation is defined in Section 6 (health checks, circuit breaker, graceful degradation).

2.2 Sidecar Service Design

The LangExtract service runs as a standalone FastAPI application deployed on Railway alongside the Mastra agent workers. It exposes a minimal HTTP API that Mastra tools call via fetch.

Endpoints

Endpoint	Method	Purpose
`POST /extract`	POST	Single document extraction with class-specific config
`POST /extract/batch`	POST	Multi-document batch extraction
`GET /health`	GET	Liveness + model connectivity check
`GET /schemas`	GET	Returns current extraction class definitions (for debugging)

Request Schema: POST /extract

{
  "document_type": "competitor_page" | "our_page" | "serp" |
                    "ai_overview" | "brand_voice" | "keyword_data",
  "text": "string (raw text content)",
  "url": "string (optional, source URL for provenance)",
  "extraction_overrides": {
    "extraction_passes": "number (default per doc type)",
    "max_char_buffer": "number (default per doc type)",
    "max_workers": "number (default: 10)"
  }
}

The service selects the correct prompt_description, examples, and processing parameters based on document_type. Overrides allow per-request tuning without redeploying the service.

Response Schema

{
  "extractions": [
    {
      "extraction_class": "string",
      "extraction_text": "string (verbatim from source)",
      "attributes": { "key": "value" },
      "source_location": { "start": 0, "end": 42 }
    }
  ],
  "document_length": 14523,
  "extraction_count": 37,
  "processing_time_ms": 2840,
  "passes_completed": 2,
  "model_id": "gemini-2.5-flash"
}

3. Extraction Class Definitions

ContentEngine defines six extraction classes. Each class has a fixed schema, a dedicated prompt_description, and a minimum of three few-shot examples. Few-shot examples are versioned in the repository under /langextract/examples/ and are loaded at service startup.

3.1 competitor_page

Trigger: New or changed competitor page detected by weekly Firecrawl sitemap diff.

Input: Raw text extracted by Firecrawl (HTML stripped, JS rendered). Typical length 1,500–10,000 words.

Extraction entities per page:

Entity Class	extraction_text Source	Required Attributes
`topic`	Section heading or topic phrase	`depth` (shallow\|deep), `section_position`, `parent_topic`
`claim`	Specific factual or statistical claim	`claim_type` (stat\|opinion\|fact), `source_cited` (bool), `specificity` (high\|medium\|low)
`keyword_signal`	Phrase appearing in H1/H2/title/meta	`placement` (h1\|h2\|title\|meta\|body), `estimated_intent` (info\|nav\|transact\|commercial)
`content_structure`	Structural element description	`element_type` (table\|list\|image\|video\|code\|calculator\|tool), `purpose`, `word_count_estimate`
`cta`	Call-to-action text	`cta_type` (link\|button\|form\|download), `target_action`, `position` (above_fold\|inline\|footer)
`entity_reference`	Named entity (product, person, brand)	`entity_type` (product\|person\|brand\|tool), `sentiment` (pos\|neutral\|neg), `context`

Design note: We extract structural elements (tables, calculators, videos) as entities rather than ignoring them. This lets the Strategy Agent know WHAT competitors are doing, not just what they wrote. A competitor page with an interactive ROI calculator is strategically different from one with only text.

3.2 our_page

Trigger: Post-publish pipeline (new content) or initial system bootstrap (existing content inventory).

Uses the same entity classes as competitor_page, plus one additional:

Entity Class	extraction_text Source	Required Attributes
`internal_link`	Anchor text of outbound internal link	`target_url`, `context_sentence`, `anchor_type` (exact_match\|partial\|branded\|generic)

The internal_link extraction feeds directly into the content relationship graph (links_to edges) and validates against the SEO check for anchor text diversity.

3.3 serp

Trigger: Weekly SERP snapshot job for tracked keywords.

Entity Class	extraction_text Source	Required Attributes
`serp_result`	Title + snippet of organic result	`position`, `url`, `domain`, `result_type` (organic\|featured_snippet\|paa\|video\|image_pack)
`serp_feature`	Feature element on SERP	`feature_type` (ai_overview\|featured_snippet\|paa\|knowledge_panel\|local_pack), `our_site_present` (bool)
`paa_question`	People Also Ask question text	`position_in_paa`, `related_to_primary_keyword` (bool)

3.4 ai_overview

Trigger: Detected during SERP snapshot when serp_feature with feature_type=ai_overview is present.

Entity Class	extraction_text Source	Required Attributes
`aio_claim`	Individual claim within AI Overview	`cited_source_url`, `cited_source_domain`, `claim_type` (stat\|fact\|recommendation), `our_site_cited` (bool)
`aio_structure`	Structural pattern of the overview	`format` (paragraph\|list\|table\|steps), `length_estimate` (short\|medium\|long), `source_count`

Note: The SEO research notes show 97% of AI Overview citations come from pages already ranking in the top 20. Extracting cited sources lets the Strategy Agent prioritize pages that are citation-eligible based on current rank.

3.5 brand_voice

Trigger: On strategy creation or update. Run once against a human-curated sample set (5–10 representative pieces).

Entity Class	extraction_text Source	Required Attributes
`tone_marker`	Sentence or phrase exemplifying tone	`tone_quality` (authoritative\|conversational\|technical\|witty\|empathetic), `intensity` (strong\|moderate\|subtle)
`vocabulary_preference`	Distinctive word or phrase choice	`category` (jargon\|branded_term\|colloquial\|formal), `frequency` (always\|often\|sometimes), `avoid` (bool)
`sentence_pattern`	Representative sentence structure	`pattern_type` (short_declarative\|compound\|rhetorical_question\|list_intro), `typical_length_words`

Brand voice extraction runs infrequently and on small document sets. Single-pass extraction at max_char_buffer=2000 is sufficient. This is the lowest-cost extraction class.

3.6 keyword_data

This class may not require LangExtract at all. Semrush/Ahrefs API responses are already structured JSON. LangExtract is only used here if we need to extract keyword intent and topical clustering from unstructured keyword research notes or analyst reports. For MVP, this class is deferred — keyword data enters the system directly from API responses via Mastra tools.

4. Processing Parameters by Document Type

Each document type has default processing parameters tuned to its typical length, complexity, and extraction density. These are configurable per-request via extraction_overrides but defaults should be correct for 90%+ of cases.

Document Type	extraction_passes	max_char_buffer	max_workers	Rationale
`competitor_page`	2	1500	10	Web content has mixed structure (nav, CTAs, sidebars). 1500-char buffer avoids splitting mid-section. 2 passes balances recall vs. cost.
`our_page`	1	1500	10	Our content is cleaner (no nav/sidebar noise in CMS body). Single pass sufficient; we control the source quality.
`serp`	1	500	5	SERP snapshots are short, highly structured. Small buffer keeps each result isolated.
`ai_overview`	2	1000	5	AI Overviews are concise but citation-dense. 2 passes improve citation recall.
`brand_voice`	1	2000	3	Small corpus, longer context helps preserve sentence-level patterns. Low concurrency is fine.

4.1 Why 1500 Characters for Web Content

The parent spec and LangExtract's Romeo & Juliet example use max_char_buffer=1000. For literary text, this works because paragraphs are self-contained. Web content is different: a single section with an H2 heading, introductory paragraph, and supporting table can easily span 1,200–1,800 characters. At 1000-char chunks, the heading is separated from its content, breaking the extraction context.

Testing against actual competitor pages in our target verticals is required before finalizing this parameter. The acceptance criteria: extraction of topic entities must preserve the association between section headings and their content in 95%+ of cases.

Action item: Before implementation, run LangExtract against 10 representative competitor pages at buffer sizes of 1000, 1500, and 2000. Measure topic extraction accuracy (heading-content association) at each size. Document results in a testing log.

4.2 Cross-Chunk Coreference Resolution

LangExtract v1.1.0 added cross-chunk context awareness for coreference resolution. This feature is critical for competitor page extraction: a page may introduce "our platform" in the first paragraph and reference "it" for the next 3,000 words. Without coreference, chunks 2+ lose entity context.

For MVP, enable cross-chunk coreference for competitor_page and our_page document types. Leave disabled for serp, ai_overview, and brand_voice (these are short or independently structured documents where cross-chunk context adds cost without benefit).

5. Trigger Model

Extraction does not run on a schedule. It runs in response to specific events. This is the core MVP constraint: extract once on discovery, not continuously.

Trigger Event	Extraction Class	Frequency	Initiated By
New competitor page detected	`competitor_page`	On discovery (weekly scan)	Trigger.dev job: `weekly-sitemap-diff`
Changed competitor page detected	`competitor_page`	On detection (weekly scan)	Trigger.dev job: `weekly-sitemap-diff`
New content published (ours)	`our_page`	On publish event	Publishing Agent post-publish pipeline
System bootstrap (existing content)	`our_page`	Once at system initialization	Manual script / Trigger.dev one-time job
Weekly SERP snapshot	`serp` + `ai_overview`	Weekly per tracked keyword	Trigger.dev job: `weekly-serp-snapshot`
Strategy created/updated	`brand_voice`	On strategy change event	Content Strategy settings UI save action

5.1 The Weekly Sitemap Diff Job

This is the only recurring extraction trigger for competitor content. The job logic:

Firecrawl crawls each competitor's sitemap (or site structure if no sitemap).
Compare returned URLs + content hashes against competitor_pages table.
New URLs (not in table): mark as new, queue for extraction.
Existing URLs with changed content_hash: mark as changed, queue for re-extraction. Old extraction rows are soft-deleted (retained for historical comparison), new extraction replaces them.
Existing URLs with unchanged content_hash: skip entirely. No extraction cost.
Removed URLs (in table but not in sitemap): mark as removed in competitor_changes. No extraction needed.

Cost implication: If a competitor publishes 5 new pages per week and updates 3 existing ones, that's 8 extraction jobs per competitor per week. At 10 competitors, that's ~80 extractions/week — well within budget.

5.2 The Post-Publish Extraction Pipeline

When the Publishing Agent pushes content to the CMS, the following extraction chain fires:

CMS confirms publish success (HTTP 200/201).
Trigger.dev job fires: fetch published page raw text via Firecrawl or CMS API.
POST /extract with document_type=our_page.
On extraction success: write rows to our_page_extractions table.
Emit event: extraction_complete with page_id. Downstream consumers (summary generation, graph builder) subscribe to this event.

Expected latency from publish to extraction_complete: 30–90 seconds depending on page length. Summary regeneration and graph updates are separate downstream jobs that do not block the extraction pipeline.

6. Quality Assurance & Reliability

6.1 Extraction Validation

Every extraction response from the sidecar is validated before being persisted to Postgres. Validation is deterministic code, not LLM-based.

Schema validation: Every extraction must have a valid extraction_class, non-empty extraction_text, and attributes matching the required schema for that class. Malformed extractions are logged and dropped.
Source grounding check: extraction_text must appear verbatim in the source document (within the text field of the request). LangExtract returns source_location; we verify it. Extractions that fail grounding are flagged for review.
Duplicate detection: If two extractions in the same response have identical extraction_class + extraction_text, keep the one with richer attributes and drop the other.
Minimum extraction threshold: If a document of 2,000+ words produces fewer than 5 extractions, flag it as a potential extraction failure. Do not persist; queue for re-extraction with extraction_passes + 1.

6.2 Sidecar Health & Circuit Breaker

The Mastra tool that calls the LangExtract sidecar implements a circuit breaker pattern:

Health check: GET /health every 60 seconds. Health check verifies FastAPI is responding AND Gemini API key is valid (makes a minimal extraction call).
Circuit states: CLOSED (normal), OPEN (failing — all requests return cached fallback or error), HALF-OPEN (testing recovery with single request).
Threshold: 3 consecutive failures or >50% failure rate in a 5-minute window opens the circuit.
Recovery: After 60 seconds in OPEN, move to HALF-OPEN. One successful request closes the circuit.
Fallback when open: Extraction jobs are queued in Trigger.dev for retry when the circuit closes. No data is lost; publication does not proceed without extraction.

6.3 Gemini Model Version Drift

LangExtract uses Gemini 2.5 Flash. Google retires model versions on a defined lifecycle. When a model version changes, extraction output may differ even with identical inputs and examples.

Mitigation strategy:

Pin model version explicitly in the sidecar config (model_id="gemini-2.5-flash" with version suffix when available).
Maintain a regression test suite: 10 documents (2 per extraction class) with known-good extraction outputs. Run the suite on every sidecar deployment and on a weekly schedule.
Regression threshold: If >15% of expected extractions are missing or >10% of attributes have changed, block deployment and alert the team.
Model migration process: When a new Gemini version is available, run the regression suite against it, compare outputs, adjust few-shot examples if needed, then cut over.

The few-shot examples ARE the specification. If extraction quality drifts, the first response is always to review and improve examples, not to add more extraction passes or change parameters.

6.4 Few-Shot Example Management

Few-shot examples are the most important artifact in the extraction pipeline. They are treated as code: versioned in git, reviewed in PRs, and tested against the regression suite.

Minimum 3 examples per extraction class. Target 5 for competitor_page (highest variety of input structures).
Examples must use real data from our actual competitor landscape, not synthetic text. Anonymize if needed, but structure and complexity must be representative.
Each example must include at least one edge case (e.g., a page with no H2 headings, a claim with no citation, a CTA buried in body text).
Examples are stored as Python data structures in /langextract/examples/{class_name}.py. The sidecar loads them at startup.
Any change to examples requires running the regression suite before merge.

7. Cost Model

All costs use Gemini 2.5 Flash pricing: $0.15 per 1M input tokens,$ 0.60 per 1M output tokens (as of March 2026). Verify current pricing before implementation.

7.1 Per-Extraction Cost Estimates

Document Type	Avg Input Tokens	Passes	Avg Output Tokens	Cost Per Doc
`competitor_page`	~6,000	2	~1,500	$0.0036
`our_page`	~5,000	1	~1,200	$0.0015
`serp`	~1,500	1	~800	$0.0007
`ai_overview`	~1,000	2	~600	$0.0010
`brand_voice`	~4,000	1	~1,000	$0.0012

Note: Input tokens include the prompt_description + few-shot examples + document chunk. Few-shot examples add ~500–1,500 tokens per request depending on class. This overhead is multiplied per chunk.

7.2 Monthly Volume Estimates (MVP Scale)

Assumptions: 10 competitors, ~8 new/changed pages per competitor per week, 50 pieces published by us per month, 100 tracked keywords.

Extraction Class	Monthly Volume	Cost	Notes
`competitor_page`	~320 docs	$1.15	80/week × 4 weeks
`our_page`	~50 docs	$0.08	50 published pieces
`serp`	~400 snapshots	$0.28	100 keywords × 4 weeks
`ai_overview`	~100 extractions	$0.10	Estimated 25% of SERPs have AIO
`brand_voice`	~2 runs	$0.002	Rare; only on strategy change
TOTAL		$1.62/month	Gemini extraction only; excludes Firecrawl, Semrush, hosting

The parent spec estimated $50–100/month for LangExtract + summary generation. At$ 1.62/month for extraction alone, extraction is a negligible cost center. The majority of that $50–100 budget will be consumed by hierarchical summary generation (a separate, more token-intensive process defined in a future addendum).

8. Open Questions for Review

The following decisions are flagged for team review before implementation begins:

#	Question	Impact	Recommended Default
1	Should `extraction_passes=2` be the default for `competitor_page`, or should we start with 1 and upgrade only if recall is insufficient?	Cost (2×) vs. recall	Start with 1, measure recall on first 50 pages, upgrade if <85% topic coverage
2	The `max_char_buffer` of 1500 is a hypothesis. Do we run the buffer size experiment (Section 4.1) before or during implementation?	Blocks schema finalization	Before. Allocate 2 days for testing against 10 real competitor pages
3	Should we extract from Firecrawl's raw text output or from a cleaned version that strips navigation, footers, and sidebars?	Extraction noise level	Use Firecrawl's main content extraction (not raw HTML). Test quality of Firecrawl's content extraction first
4	The `brand_voice` extraction runs against 5–10 curated samples. Who selects those samples, and what's the selection criteria?	Voice extraction quality	Content lead selects. Criteria: pieces that best represent the target voice, not the highest-performing pieces
5	Do we need a separate extraction class for FAQ/PAA-style content on competitor pages, or does the existing `claim` + `topic` schema cover it?	Schema complexity	Defer. The `claim` class with `claim_type=fact` covers FAQ-style content adequately for MVP
6	Cross-chunk coreference adds latency. Acceptable threshold for extraction time per document?	User experience / pipeline speed	90 seconds max per document. If coreference pushes beyond this, disable and accept lower entity resolution

End of Addendum #1

Next addendum: Hierarchical Summary Generation — defining how LangExtract outputs are aggregated into navigable Level 0–3 summaries for agent consumption.