ContentEngine — Technical Addendum #1
LangExtract Implementation & Scope
| Author | Alton Wells |
| Date | March 2026 |
| Status | Draft for Review |
| Parent Spec | ContentEngine Technical Specification v3 |
| Scope | LangExtract extraction layer: integration, schemas, triggers, cost, QA |
1. Purpose & Scope
This addendum defines the exact scope, integration pattern, extraction schemas, trigger logic, quality assurance strategy, and cost model for LangExtract within the ContentEngine MVP. It supersedes any conflicting detail in the parent specification and constrains implementation to the minimum viable extraction layer.
1.1 What This Addendum Covers
- Integration architecture: Python sidecar vs. Node SDK decision and rationale
- Six extraction class definitions with full attribute schemas and few-shot example structure
- Processing parameters: passes, chunking, concurrency per document type
- Trigger model: when extraction runs and what initiates it
- Quality assurance: validation, drift monitoring, regression testing
- Cost model: token-level estimates based on actual document volumes
1.2 What This Addendum Does Not Cover
- Filesystem-as-context architecture (deferred per project decision)
- Hierarchical summary generation (separate addendum)
- Graph relationship builder logic (separate addendum)
- Agent prompt engineering (separate addendum)
1.3 Key MVP Constraint
Competitor content is extracted once on discovery. Weekly monitoring detects new/changed pages only. Re-extraction of unchanged content is out of scope for MVP.
2. Integration Architecture
2.1 Decision: Python Sidecar (Not Node SDK)
The parent spec calls for a FastAPI sidecar running LangExtract in Python. An unofficial Node.js SDK exists. This addendum recommends the Python sidecar for the following reasons:
- The official LangExtract library is Python-only (google/langextract, Apache 2.0, 34.4k GitHub stars). The Node SDK is unofficial, community-maintained, and has no guaranteed feature parity.
- Critical features used in ContentEngine are Python-only: multi-pass extraction (
extraction_passes), cross-chunk coreference resolution, Vertex AI batch processing, controlled generation via Gemini schema constraints. - The Node SDK documentation explicitly states it requires
fence_output=Trueanduse_schema_constraints=Falsefor non-Gemini models, suggesting incomplete schema enforcement. - LangExtract version 1.1.1 is current. The cross-chunk context awareness feature (coreference resolution) shipped 3 months ago and is not present in the Node SDK.
Risk acceptance: The Python–TypeScript bridge is a single point of failure. Mitigation is defined in Section 6 (health checks, circuit breaker, graceful degradation).
2.2 Sidecar Service Design
The LangExtract service runs as a standalone FastAPI application deployed on Railway alongside the Mastra agent workers. It exposes a minimal HTTP API that Mastra tools call via fetch.
Endpoints
| Endpoint | Method | Purpose |
|---|---|---|
POST /extract | POST | Single document extraction with class-specific config |
POST /extract/batch | POST | Multi-document batch extraction |
GET /health | GET | Liveness + model connectivity check |
GET /schemas | GET | Returns current extraction class definitions (for debugging) |
Request Schema: POST /extract
1{
2 "document_type": "competitor_page" | "our_page" | "serp" |
3 "ai_overview" | "brand_voice" | "keyword_data",
4 "text": "string (raw text content)",
5 "url": "string (optional, source URL for provenance)",
6 "extraction_overrides": {
7 "extraction_passes": "number (default per doc type)",
8 "max_char_buffer": "number (default per doc type)",
9 "max_workers": "number (default: 10)"
10 }
11}The service selects the correct prompt_description, examples, and processing parameters based on document_type. Overrides allow per-request tuning without redeploying the service.
Response Schema
1{
2 "extractions": [
3 {
4 "extraction_class": "string",
5 "extraction_text": "string (verbatim from source)",
6 "attributes": { "key": "value" },
7 "source_location": { "start": 0, "end": 42 }
8 }
9 ],
10 "document_length": 14523,
11 "extraction_count": 37,
12 "processing_time_ms": 2840,
13 "passes_completed": 2,
14 "model_id": "gemini-2.5-flash"
15}3. Extraction Class Definitions
ContentEngine defines six extraction classes. Each class has a fixed schema, a dedicated prompt_description, and a minimum of three few-shot examples. Few-shot examples are versioned in the repository under /langextract/examples/ and are loaded at service startup.
3.1 competitor_page
Trigger: New or changed competitor page detected by weekly Firecrawl sitemap diff.
Input: Raw text extracted by Firecrawl (HTML stripped, JS rendered). Typical length 1,500–10,000 words.
Extraction entities per page:
| Entity Class | extraction_text Source | Required Attributes |
|---|---|---|
topic | Section heading or topic phrase | depth (shallow|deep), section_position, parent_topic |
claim | Specific factual or statistical claim | claim_type (stat|opinion|fact), source_cited (bool), specificity (high|medium|low) |
keyword_signal | Phrase appearing in H1/H2/title/meta | placement (h1|h2|title|meta|body), estimated_intent (info|nav|transact|commercial) |
content_structure | Structural element description | element_type (table|list|image|video|code|calculator|tool), purpose, word_count_estimate |
cta | Call-to-action text | cta_type (link|button|form|download), target_action, position (above_fold|inline|footer) |
entity_reference | Named entity (product, person, brand) | entity_type (product|person|brand|tool), sentiment (pos|neutral|neg), context |
Design note: We extract structural elements (tables, calculators, videos) as entities rather than ignoring them. This lets the Strategy Agent know WHAT competitors are doing, not just what they wrote. A competitor page with an interactive ROI calculator is strategically different from one with only text.
3.2 our_page
Trigger: Post-publish pipeline (new content) or initial system bootstrap (existing content inventory).
Uses the same entity classes as competitor_page, plus one additional:
| Entity Class | extraction_text Source | Required Attributes |
|---|---|---|
internal_link | Anchor text of outbound internal link | target_url, context_sentence, anchor_type (exact_match|partial|branded|generic) |
The internal_link extraction feeds directly into the content relationship graph (links_to edges) and validates against the SEO check for anchor text diversity.
3.3 serp
Trigger: Weekly SERP snapshot job for tracked keywords.
| Entity Class | extraction_text Source | Required Attributes |
|---|---|---|
serp_result | Title + snippet of organic result | position, url, domain, result_type (organic|featured_snippet|paa|video|image_pack) |
serp_feature | Feature element on SERP | feature_type (ai_overview|featured_snippet|paa|knowledge_panel|local_pack), our_site_present (bool) |
paa_question | People Also Ask question text | position_in_paa, related_to_primary_keyword (bool) |
3.4 ai_overview
Trigger: Detected during SERP snapshot when serp_feature with feature_type=ai_overview is present.
| Entity Class | extraction_text Source | Required Attributes |
|---|---|---|
aio_claim | Individual claim within AI Overview | cited_source_url, cited_source_domain, claim_type (stat|fact|recommendation), our_site_cited (bool) |
aio_structure | Structural pattern of the overview | format (paragraph|list|table|steps), length_estimate (short|medium|long), source_count |
Note: The SEO research notes show 97% of AI Overview citations come from pages already ranking in the top 20. Extracting cited sources lets the Strategy Agent prioritize pages that are citation-eligible based on current rank.
3.5 brand_voice
Trigger: On strategy creation or update. Run once against a human-curated sample set (5–10 representative pieces).
| Entity Class | extraction_text Source | Required Attributes |
|---|---|---|
tone_marker | Sentence or phrase exemplifying tone | tone_quality (authoritative|conversational|technical|witty|empathetic), intensity (strong|moderate|subtle) |
vocabulary_preference | Distinctive word or phrase choice | category (jargon|branded_term|colloquial|formal), frequency (always|often|sometimes), avoid (bool) |
sentence_pattern | Representative sentence structure | pattern_type (short_declarative|compound|rhetorical_question|list_intro), typical_length_words |
Brand voice extraction runs infrequently and on small document sets. Single-pass extraction at max_char_buffer=2000 is sufficient. This is the lowest-cost extraction class.
3.6 keyword_data
This class may not require LangExtract at all. Semrush/Ahrefs API responses are already structured JSON. LangExtract is only used here if we need to extract keyword intent and topical clustering from unstructured keyword research notes or analyst reports. For MVP, this class is deferred — keyword data enters the system directly from API responses via Mastra tools.
4. Processing Parameters by Document Type
Each document type has default processing parameters tuned to its typical length, complexity, and extraction density. These are configurable per-request via extraction_overrides but defaults should be correct for 90%+ of cases.
| Document Type | extraction_passes | max_char_buffer | max_workers | Rationale |
|---|---|---|---|---|
competitor_page | 2 | 1500 | 10 | Web content has mixed structure (nav, CTAs, sidebars). 1500-char buffer avoids splitting mid-section. 2 passes balances recall vs. cost. |
our_page | 1 | 1500 | 10 | Our content is cleaner (no nav/sidebar noise in CMS body). Single pass sufficient; we control the source quality. |
serp | 1 | 500 | 5 | SERP snapshots are short, highly structured. Small buffer keeps each result isolated. |
ai_overview | 2 | 1000 | 5 | AI Overviews are concise but citation-dense. 2 passes improve citation recall. |
brand_voice | 1 | 2000 | 3 | Small corpus, longer context helps preserve sentence-level patterns. Low concurrency is fine. |
4.1 Why 1500 Characters for Web Content
The parent spec and LangExtract's Romeo & Juliet example use max_char_buffer=1000. For literary text, this works because paragraphs are self-contained. Web content is different: a single section with an H2 heading, introductory paragraph, and supporting table can easily span 1,200–1,800 characters. At 1000-char chunks, the heading is separated from its content, breaking the extraction context.
Testing against actual competitor pages in our target verticals is required before finalizing this parameter. The acceptance criteria: extraction of topic entities must preserve the association between section headings and their content in 95%+ of cases.
Action item: Before implementation, run LangExtract against 10 representative competitor pages at buffer sizes of 1000, 1500, and 2000. Measure topic extraction accuracy (heading-content association) at each size. Document results in a testing log.
4.2 Cross-Chunk Coreference Resolution
LangExtract v1.1.0 added cross-chunk context awareness for coreference resolution. This feature is critical for competitor page extraction: a page may introduce "our platform" in the first paragraph and reference "it" for the next 3,000 words. Without coreference, chunks 2+ lose entity context.
For MVP, enable cross-chunk coreference for competitor_page and our_page document types. Leave disabled for serp, ai_overview, and brand_voice (these are short or independently structured documents where cross-chunk context adds cost without benefit).
5. Trigger Model
Extraction does not run on a schedule. It runs in response to specific events. This is the core MVP constraint: extract once on discovery, not continuously.
| Trigger Event | Extraction Class | Frequency | Initiated By |
|---|---|---|---|
| New competitor page detected | competitor_page | On discovery (weekly scan) | Trigger.dev job: weekly-sitemap-diff |
| Changed competitor page detected | competitor_page | On detection (weekly scan) | Trigger.dev job: weekly-sitemap-diff |
| New content published (ours) | our_page | On publish event | Publishing Agent post-publish pipeline |
| System bootstrap (existing content) | our_page | Once at system initialization | Manual script / Trigger.dev one-time job |
| Weekly SERP snapshot | serp + ai_overview | Weekly per tracked keyword | Trigger.dev job: weekly-serp-snapshot |
| Strategy created/updated | brand_voice | On strategy change event | Content Strategy settings UI save action |
5.1 The Weekly Sitemap Diff Job
This is the only recurring extraction trigger for competitor content. The job logic:
- Firecrawl crawls each competitor's sitemap (or site structure if no sitemap).
- Compare returned URLs + content hashes against
competitor_pagestable. - New URLs (not in table): mark as
new, queue for extraction. - Existing URLs with changed
content_hash: mark aschanged, queue for re-extraction. Old extraction rows are soft-deleted (retained for historical comparison), new extraction replaces them. - Existing URLs with unchanged
content_hash: skip entirely. No extraction cost. - Removed URLs (in table but not in sitemap): mark as
removedincompetitor_changes. No extraction needed.
Cost implication: If a competitor publishes 5 new pages per week and updates 3 existing ones, that's 8 extraction jobs per competitor per week. At 10 competitors, that's ~80 extractions/week — well within budget.
5.2 The Post-Publish Extraction Pipeline
When the Publishing Agent pushes content to the CMS, the following extraction chain fires:
- CMS confirms publish success (HTTP 200/201).
- Trigger.dev job fires: fetch published page raw text via Firecrawl or CMS API.
POST /extractwithdocument_type=our_page.- On extraction success: write rows to
our_page_extractionstable. - Emit event:
extraction_completewithpage_id. Downstream consumers (summary generation, graph builder) subscribe to this event.
Expected latency from publish to extraction_complete: 30–90 seconds depending on page length. Summary regeneration and graph updates are separate downstream jobs that do not block the extraction pipeline.
6. Quality Assurance & Reliability
6.1 Extraction Validation
Every extraction response from the sidecar is validated before being persisted to Postgres. Validation is deterministic code, not LLM-based.
- Schema validation: Every extraction must have a valid
extraction_class, non-emptyextraction_text, and attributes matching the required schema for that class. Malformed extractions are logged and dropped. - Source grounding check:
extraction_textmust appear verbatim in the source document (within thetextfield of the request). LangExtract returnssource_location; we verify it. Extractions that fail grounding are flagged for review. - Duplicate detection: If two extractions in the same response have identical
extraction_class+extraction_text, keep the one with richer attributes and drop the other. - Minimum extraction threshold: If a document of 2,000+ words produces fewer than 5 extractions, flag it as a potential extraction failure. Do not persist; queue for re-extraction with
extraction_passes + 1.
6.2 Sidecar Health & Circuit Breaker
The Mastra tool that calls the LangExtract sidecar implements a circuit breaker pattern:
- Health check:
GET /healthevery 60 seconds. Health check verifies FastAPI is responding AND Gemini API key is valid (makes a minimal extraction call). - Circuit states: CLOSED (normal), OPEN (failing — all requests return cached fallback or error), HALF-OPEN (testing recovery with single request).
- Threshold: 3 consecutive failures or >50% failure rate in a 5-minute window opens the circuit.
- Recovery: After 60 seconds in OPEN, move to HALF-OPEN. One successful request closes the circuit.
- Fallback when open: Extraction jobs are queued in Trigger.dev for retry when the circuit closes. No data is lost; publication does not proceed without extraction.
6.3 Gemini Model Version Drift
LangExtract uses Gemini 2.5 Flash. Google retires model versions on a defined lifecycle. When a model version changes, extraction output may differ even with identical inputs and examples.
Mitigation strategy:
- Pin model version explicitly in the sidecar config (
model_id="gemini-2.5-flash"with version suffix when available). - Maintain a regression test suite: 10 documents (2 per extraction class) with known-good extraction outputs. Run the suite on every sidecar deployment and on a weekly schedule.
- Regression threshold: If >15% of expected extractions are missing or >10% of attributes have changed, block deployment and alert the team.
- Model migration process: When a new Gemini version is available, run the regression suite against it, compare outputs, adjust few-shot examples if needed, then cut over.
The few-shot examples ARE the specification. If extraction quality drifts, the first response is always to review and improve examples, not to add more extraction passes or change parameters.
6.4 Few-Shot Example Management
Few-shot examples are the most important artifact in the extraction pipeline. They are treated as code: versioned in git, reviewed in PRs, and tested against the regression suite.
- Minimum 3 examples per extraction class. Target 5 for
competitor_page(highest variety of input structures). - Examples must use real data from our actual competitor landscape, not synthetic text. Anonymize if needed, but structure and complexity must be representative.
- Each example must include at least one edge case (e.g., a page with no H2 headings, a claim with no citation, a CTA buried in body text).
- Examples are stored as Python data structures in
/langextract/examples/{class_name}.py. The sidecar loads them at startup. - Any change to examples requires running the regression suite before merge.
7. Cost Model
All costs use Gemini 2.5 Flash pricing: 0.60 per 1M output tokens (as of March 2026). Verify current pricing before implementation.
7.1 Per-Extraction Cost Estimates
| Document Type | Avg Input Tokens | Passes | Avg Output Tokens | Cost Per Doc |
|---|---|---|---|---|
competitor_page | ~6,000 | 2 | ~1,500 | $0.0036 |
our_page | ~5,000 | 1 | ~1,200 | $0.0015 |
serp | ~1,500 | 1 | ~800 | $0.0007 |
ai_overview | ~1,000 | 2 | ~600 | $0.0010 |
brand_voice | ~4,000 | 1 | ~1,000 | $0.0012 |
Note: Input tokens include the
prompt_description+ few-shot examples + document chunk. Few-shot examples add ~500–1,500 tokens per request depending on class. This overhead is multiplied per chunk.
7.2 Monthly Volume Estimates (MVP Scale)
Assumptions: 10 competitors, ~8 new/changed pages per competitor per week, 50 pieces published by us per month, 100 tracked keywords.
| Extraction Class | Monthly Volume | Cost | Notes |
|---|---|---|---|
competitor_page | ~320 docs | $1.15 | 80/week × 4 weeks |
our_page | ~50 docs | $0.08 | 50 published pieces |
serp | ~400 snapshots | $0.28 | 100 keywords × 4 weeks |
ai_overview | ~100 extractions | $0.10 | Estimated 25% of SERPs have AIO |
brand_voice | ~2 runs | $0.002 | Rare; only on strategy change |
| TOTAL | $1.62/month | Gemini extraction only; excludes Firecrawl, Semrush, hosting |
The parent spec estimated 1.62/month for extraction alone, extraction is a negligible cost center. The majority of that $50–100 budget will be consumed by hierarchical summary generation (a separate, more token-intensive process defined in a future addendum).
8. Open Questions for Review
The following decisions are flagged for team review before implementation begins:
| # | Question | Impact | Recommended Default |
|---|---|---|---|
| 1 | Should extraction_passes=2 be the default for competitor_page, or should we start with 1 and upgrade only if recall is insufficient? | Cost (2×) vs. recall | Start with 1, measure recall on first 50 pages, upgrade if <85% topic coverage |
| 2 | The max_char_buffer of 1500 is a hypothesis. Do we run the buffer size experiment (Section 4.1) before or during implementation? | Blocks schema finalization | Before. Allocate 2 days for testing against 10 real competitor pages |
| 3 | Should we extract from Firecrawl's raw text output or from a cleaned version that strips navigation, footers, and sidebars? | Extraction noise level | Use Firecrawl's main content extraction (not raw HTML). Test quality of Firecrawl's content extraction first |
| 4 | The brand_voice extraction runs against 5–10 curated samples. Who selects those samples, and what's the selection criteria? | Voice extraction quality | Content lead selects. Criteria: pieces that best represent the target voice, not the highest-performing pieces |
| 5 | Do we need a separate extraction class for FAQ/PAA-style content on competitor pages, or does the existing claim + topic schema cover it? | Schema complexity | Defer. The claim class with claim_type=fact covers FAQ-style content adequately for MVP |
| 6 | Cross-chunk coreference adds latency. Acceptable threshold for extraction time per document? | User experience / pipeline speed | 90 seconds max per document. If coreference pushes beyond this, disable and accept lower entity resolution |
End of Addendum #1
Next addendum: Hierarchical Summary Generation — defining how LangExtract outputs are aggregated into navigable Level 0–3 summaries for agent consumption.