MDX Limo
Vendor Ingestion Architecture

Vendor Ingestion Architecture

Recommended file: /Users/wells/Documents/GitHub/peptide-portal/docs/vendor-ingestion-architecture.md

Summary

Build a review-first vendor ingestion pipeline that fits the current Next.js + Supabase app without putting scraping work inside the web app. The system will use a scheduled Bun worker for ingestion, Supabase Vault for credentials and browser sessions, and staged Supabase tables before anything reaches public marketplace tables.

The architecture supports slow vendor access lead time by starting with manual CSV/export/public sources, then layering in authenticated portal crawling as credentials arrive.

No AI is used. Product matching, parsing, normalization, and publishing must be deterministic. Ambiguous records go to human review.

Current Architecture Snapshot

Verified through local repo inspection and Supabase MCP on May 12, 2026.

  • App stack: Next.js 16.2.4, React 19.2.4, App Router, Supabase Auth/DB, Tailwind 4, shadcn/Radix UI, pnpm root package.
  • Live Supabase project: Peptides / eorheiigbzfqbkjizmox.
  • Core catalog tables:
    • vendors: 14 rows.
    • peptides: 58 rows.
    • categories: 9 rows.
    • vendor_products: 36 rows.
    • vial_options: 75 rows.
    • vendor_listings: 75 visible listing rows.
  • Existing Finnrick tables exist live:
    • finnrick_vendor_mapping: 14 rows.
    • finnrick_peptide_mapping: 58 rows.
    • finnrick_vendor_stats: 12 rows.
    • finnrick_product_tests: 102 rows.
  • Supabase extensions available:
    • supabase_vault 0.3.1.
    • pg_cron 1.6.4.
    • pg_net 0.20.0.
  • Supabase Vault is installed. MCP confirmed vault.secrets exists and contains one row, but secret values were not read.
  • Public RLS policies allow reads for active catalog data, but there are no public write policies for core catalog tables. This is good: ingestion should write with service-role/server-only access.

Existing Catalog Contract

The current app displays marketplace data through vendor_listings, which joins:

  • vendor_products.vendor_id -> vendors.id
  • vendor_products.peptide_id -> peptides.id
  • vial_options.vendor_product_id -> vendor_products.id

A listing is visible only when:

  • vendors.is_active = true
  • vendor_products.is_active = true
  • vial_options.is_available = true

Publishing must therefore produce or update:

  • vendors only when onboarding a new vendor.
  • vendor_products for product-level data: vendor_id, peptide_id, name, category, purity, salt, product_url, is_active.
  • vial_options for sellable variants: vendor_product_id, size, price, price_per_mg, is_available, sort_order.

Important: app-level peptide routes use slugs, but database relationships require peptides.id UUIDs. The worker must map scraped product names to peptide UUIDs before publishing.

Architecture

The Next app remains the reader/reviewer UI. The Bun worker owns crawling, parsing, normalization, retries, and staged writes.

Source Ladder

Every vendor adapter should try sources in this order:

  1. Manual seed files: existing CSVs in /Users/wells/Documents/GitHub/peptide-portal/docs/vendors.
  2. Vendor-provided exports: CSV, XLSX, JSON, XML, affiliate feeds.
  3. Partner/API access: Shopify Admin API, WooCommerce REST API, custom vendor APIs.
  4. Public structured crawl: sitemap, product JSON-LD, embedded product JSON, static HTML.
  5. Authenticated browser crawl: Playwright through Crawlee for login-gated catalogs.
  6. Managed browser infrastructure later if local browser jobs become flaky or blocked.

Playwright is the fallback for gated, JavaScript-heavy, or session-dependent portals. It should not be the default for sources that can be fetched directly.

Runtime Decision

Use a scheduled Bun worker, not an always-on sidecar.

  • Worker path: /Users/wells/Documents/GitHub/peptide-portal/workers/vendor-sync.
  • Runtime: Bun.
  • Browser framework: Crawlee with Playwright adapters only where required.
  • Deployment target: container host or scheduled job runner, not Vercel request handlers.
  • Trigger modes:
    • Manual CLI for development.
    • Scheduled cron on the worker host for production.
    • Optional later: Supabase pg_cron + pg_net can call a worker webhook, but the database must not run browser jobs.

Do not run authenticated browser automation inside:

  • Next.js route handlers.
  • Vercel serverless functions.
  • Supabase Edge Functions.

Proposed Database Additions

Create new private-by-policy tables in public so Supabase JS service-role access works without extra exposed-schema configuration. Enable RLS on all new tables and add no anon/auth public policies. Access should happen only through service-role worker code and server/admin review paths.

vendor_sync_sources

Stores one configured source per vendor per ingestion method.

Required columns:

  • id uuid primary key default gen_random_uuid()
  • vendor_id uuid references vendors(id)
  • display_name text not null
  • adapter_key text not null
  • source_type text not null
    • allowed: manual_csv, manual_export, public_http, platform_api, authenticated_browser, finnrick, other
  • base_url text
  • login_url text
  • catalog_url text
  • credential_secret_name text
  • session_secret_name text
  • config jsonb default '{}'
  • priority integer default 100
  • enabled boolean default false
  • publish_mode text default 'review'
    • allowed: dry_run, review, auto_safe
  • last_success_at timestamptz
  • created_at timestamptz default now()
  • updated_at timestamptz default now()

Indexes:

  • unique (vendor_id, adapter_key)
  • index (enabled, priority)
  • index (source_type)

vendor_sync_runs

Tracks every ingestion attempt.

Columns:

  • id uuid primary key default gen_random_uuid()
  • source_id uuid references vendor_sync_sources(id)
  • vendor_id uuid references vendors(id)
  • status text not null
    • allowed: queued, running, succeeded, partial, failed, cancelled
  • trigger_type text not null
    • allowed: manual, schedule, webhook, retry
  • mode text not null
    • allowed: dry_run, review, publish
  • started_at timestamptz default now()
  • completed_at timestamptz
  • duration_ms integer
  • products_seen integer default 0
  • products_extracted integer default 0
  • products_normalized integer default 0
  • variants_seen integer default 0
  • needs_review integer default 0
  • approved integer default 0
  • published integer default 0
  • failed integer default 0
  • error_summary text
  • error_details jsonb
  • artifact_prefix text
  • created_at timestamptz default now()

vendor_ingested_products

Stores raw and normalized product-level records for each run.

Columns:

  • id uuid primary key default gen_random_uuid()
  • run_id uuid references vendor_sync_runs(id)
  • source_id uuid references vendor_sync_sources(id)
  • vendor_id uuid references vendors(id)
  • external_product_key text not null
  • source_url text
  • source_sku text
  • raw_name text not null
  • raw_payload jsonb default '{}'
  • raw_payload_hash text not null
  • raw_html_hash text
  • normalized_name text
  • product_url text
  • peptide_candidate_name text
  • peptide_id uuid references peptides(id)
  • peptide_match_status text not null default 'unknown'
    • allowed: matched, ambiguous, unknown, ignored, rejected
  • peptide_match_rule text
  • peptide_match_notes text
  • category text
  • purity_text text
  • purity_percent numeric
  • salt text
  • coa_url text
  • stock_status text
  • normalized_hash text
  • review_status text not null default 'pending'
    • allowed: pending, approved, rejected, published, superseded
  • warnings jsonb default '[]'
  • created_at timestamptz default now()

Indexes:

  • unique (run_id, external_product_key)
  • index (source_id, external_product_key)
  • index (vendor_id, review_status)
  • index (peptide_id)

vendor_ingested_variants

Stores vial/variant records under ingested products.

Columns:

  • id uuid primary key default gen_random_uuid()
  • ingested_product_id uuid references vendor_ingested_products(id)
  • run_id uuid references vendor_sync_runs(id)
  • source_id uuid references vendor_sync_sources(id)
  • vendor_id uuid references vendors(id)
  • external_variant_key text not null
  • label text
  • size_text text not null
  • amount_mg numeric
  • vial_count integer default 1
  • total_mg numeric
  • price numeric
  • currency text default 'USD'
  • price_per_mg numeric
  • is_available boolean default true
  • sku text
  • raw_payload jsonb default '{}'
  • warnings jsonb default '[]'
  • created_at timestamptz default now()

Indexes:

  • unique (ingested_product_id, external_variant_key)
  • index (run_id)
  • index (vendor_id)

vendor_peptide_alias_rules

Human-maintained deterministic matching rules. This avoids AI matching.

Columns:

  • id uuid primary key default gen_random_uuid()
  • source_id uuid references vendor_sync_sources(id)
  • alias_normalized text not null
  • peptide_id uuid references peptides(id)
  • status text default 'active'
    • allowed: active, disabled
  • notes text
  • created_at timestamptz default now()
  • updated_at timestamptz default now()

Indexes:

  • unique (source_id, alias_normalized)
  • index (peptide_id)

Matching precedence:

  1. Vendor-specific alias rule from vendor_peptide_alias_rules.
  2. Exact normalized peptides.name.
  3. Exact normalized peptides.slug.
  4. Exact normalized values from peptides.aliases.
  5. Existing Finnrick mapping when source is Finnrick-derived.
  6. Otherwise unknown or ambiguous, requiring review.

Maps a vendor source’s stable external product key to the public vendor_products row. This prevents duplicate public products without adding risky uniqueness constraints to existing catalog tables.

Columns:

  • id uuid primary key default gen_random_uuid()
  • source_id uuid references vendor_sync_sources(id)
  • external_product_key text not null
  • vendor_product_id uuid references vendor_products(id)
  • first_seen_run_id uuid references vendor_sync_runs(id)
  • last_seen_run_id uuid references vendor_sync_runs(id)
  • link_status text default 'active'
    • allowed: active, inactive, needs_review
  • created_at timestamptz default now()
  • updated_at timestamptz default now()

Indexes:

  • unique (source_id, external_product_key)
  • index (vendor_product_id)

Maps external variant keys to public vial_options.

Columns:

  • id uuid primary key default gen_random_uuid()
  • source_id uuid references vendor_sync_sources(id)
  • external_variant_key text not null
  • vial_option_id uuid references vial_options(id)
  • vendor_product_id uuid references vendor_products(id)
  • first_seen_run_id uuid references vendor_sync_runs(id)
  • last_seen_run_id uuid references vendor_sync_runs(id)
  • link_status text default 'active'
  • created_at timestamptz default now()
  • updated_at timestamptz default now()

Indexes:

  • unique (source_id, external_variant_key)
  • index (vial_option_id)

vendor_product_observations

Keeps historical price/availability observations separate from public listing state.

Columns:

  • id uuid primary key default gen_random_uuid()
  • run_id uuid references vendor_sync_runs(id)
  • vendor_product_id uuid references vendor_products(id)
  • vial_option_id uuid references vial_options(id)
  • ingested_product_id uuid references vendor_ingested_products(id)
  • ingested_variant_id uuid references vendor_ingested_variants(id)
  • price numeric
  • price_per_mg numeric
  • is_available boolean
  • observed_at timestamptz default now()
  • source_url text
  • payload_hash text

Indexes:

  • index (vendor_product_id, observed_at desc)
  • index (vial_option_id, observed_at desc)

vendor_publish_events

Audits every change made to public catalog tables by the ingestion workflow.

Columns:

  • id uuid primary key default gen_random_uuid()
  • run_id uuid references vendor_sync_runs(id)
  • source_id uuid references vendor_sync_sources(id)
  • vendor_id uuid references vendors(id)
  • ingested_product_id uuid references vendor_ingested_products(id)
  • action text not null
    • allowed: create_product, update_product, create_variant, update_variant, mark_unavailable, mark_inactive, skip
  • target_table text not null
  • target_id uuid
  • before_json jsonb
  • after_json jsonb
  • actor_type text not null
    • allowed: worker, admin
  • actor_user_id uuid references profiles(id)
  • created_at timestamptz default now()

Vault Design

Use Supabase Vault for credentials and browser sessions.

Secret naming convention:

  • Credentials: vendor-sync:{vendor_slug}:{adapter_key}:credentials
  • Browser session state: vendor-sync:{vendor_slug}:{adapter_key}:playwright-session
  • Optional TOTP seed only if explicitly provided and legally allowed: vendor-sync:{vendor_slug}:{adapter_key}:totp

Credential secret JSON shape:

1{ 2 "username": "vendor-login@example.com", 3 "password": "redacted", 4 "notes": "Authorized vendor portal credentials", 5 "otpMode": "none" 6}

Playwright session secret JSON shape:

1{ 2 "storageState": {}, 3 "savedAt": "2026-05-12T00:00:00.000Z", 4 "expiresAt": null, 5 "loginUrl": "https://vendor.example.com/login" 6}

Access pattern:

  • The worker never stores secrets in normal tables.
  • vendor_sync_sources stores only Vault secret names.
  • Add service-role-only RPCs:
    • get_vendor_sync_secret(secret_name text) returns text
    • upsert_vendor_sync_secret(secret_name text, secret_value text, description text) returns uuid
  • Revoke execution from public, anon, and authenticated.
  • Grant execution only to service_role.
  • Never log secret values, cookies, authorization headers, or full browser storage state.

Worker Design

Suggested folder structure:

1workers/vendor-sync/ 2 package.json 3 bun.lock 4 src/ 5 index.ts 6 cli.ts 7 config/env.ts 8 db/supabase.ts 9 db/vault.ts 10 db/runs.ts 11 adapters/base.ts 12 adapters/manual-csv.ts 13 adapters/public-http.ts 14 adapters/platform-api.ts 15 adapters/authenticated-browser.ts 16 adapters/vendors/ 17 normalize/product.ts 18 normalize/price.ts 19 normalize/peptide-match.ts 20 publish/plan.ts 21 publish/apply.ts 22 artifacts/store.ts 23 logging.ts 24 fixtures/ 25 tests/

Core commands:

1bun run sync -- --source=<source-id> --mode=dry_run 2bun run sync -- --source=<source-id> --mode=review 3bun run sync:vendor -- --vendor=<vendor-slug> --mode=review 4bun run publish -- --run=<run-id> --approved-only 5bun run audit -- --vendor=<vendor-slug>

Environment variables:

  • SUPABASE_URL
  • SUPABASE_SERVICE_ROLE_KEY
  • VENDOR_SYNC_ARTIFACT_BUCKET
  • VENDOR_SYNC_DEFAULT_TIMEOUT_MS
  • VENDOR_SYNC_CONCURRENCY
  • VENDOR_SYNC_USER_AGENT

Adapter Interface

Every adapter returns deterministic extracted records.

1type VendorAdapterContext = { 2 sourceId: string 3 vendorId: string 4 sourceConfig: Record<string, unknown> 5 baseUrl?: string 6 loginUrl?: string 7 catalogUrl?: string 8 getSecret(name: string): Promise<string | null> 9 saveSecret(name: string, value: string): Promise<void> 10 artifactPrefix: string 11} 12 13type ExtractedVendorProduct = { 14 externalProductKey: string 15 rawName: string 16 sourceUrl?: string 17 productUrl?: string 18 sourceSku?: string 19 category?: string 20 purityText?: string 21 salt?: string 22 coaUrl?: string 23 stockStatus?: string 24 rawPayload: Record<string, unknown> 25 variants: ExtractedVendorVariant[] 26} 27 28type ExtractedVendorVariant = { 29 externalVariantKey: string 30 label?: string 31 sizeText: string 32 amountMg?: number 33 vialCount?: number 34 totalMg?: number 35 price?: number 36 currency?: "USD" 37 isAvailable?: boolean 38 sku?: string 39 rawPayload?: Record<string, unknown> 40}

Authenticated Browser Flow

  1. Load vendor_sync_sources row.
  2. Fetch credential secret from Vault.
  3. Fetch existing Playwright storageState secret if available.
  4. Start browser context with saved session.
  5. Visit catalog URL.
  6. If login gate appears, run adapter-specific login.
  7. If captcha, unsupported 2FA, or account approval appears, stop and mark run partial or failed.
  8. After successful login, save updated storageState back to Vault.
  9. Crawl product list and product detail pages with strict rate limits.
  10. Save screenshots/HTML hashes on failures.
  11. Emit extracted records.

Rules:

  • Do not bypass captchas.
  • Do not circumvent access controls.
  • Do not automate around forbidden 2FA flows.
  • Use only authorized credentials.
  • Respect vendor terms, crawl delays, and rate limits.
  • Treat cookies/session state as secrets.

Normalization Rules

Product normalization must be deterministic:

  • Normalize whitespace, punctuation, unicode fractions, and case.
  • Preserve original vendor names in raw_name.
  • Extract size patterns like 5mg, 10 mg, 2 x 5mg, 10mg vial, 30 mg kit.
  • Compute:
    • amount_mg
    • vial_count
    • total_mg
    • price_per_mg = price / total_mg
  • Parse purity only from explicit values like 99%, >99%, 99.5%.
  • Preserve uncertain purity as purity_text and warn instead of inventing values.
  • Map salts only from explicit terms such as acetate, tfa, hcl, base.
  • Normalize availability into in_stock, out_of_stock, backorder, unknown.
  • Unknown peptide matches stay staged and never publish.

Review-First Publishing

V1 publishing policy: review required before public listing changes.

Review queue shows:

  • Vendor/source/run.
  • Raw vendor product name.
  • Normalized name.
  • Candidate peptide match.
  • Parsed variants and prices.
  • COA/product URLs.
  • Warnings.
  • Diff against existing public listing, if linked.

Reviewer actions:

  • Approve product.
  • Reject product.
  • Assign peptide.
  • Add vendor-specific alias rule.
  • Link to existing vendor_products.
  • Create new public product on publish.
  • Approve specific variants.
  • Reject specific variants.

Publishing behavior:

  • Approved product creates or updates vendor_products.
  • Approved variants create or update vial_options.
  • Missing products are not immediately deactivated.
  • A product or variant should only be marked inactive/unavailable after a configurable number of consecutive misses, default 3.
  • Every public-table mutation writes a vendor_publish_events row.

Admin UI Integration

Add a small internal admin surface after schema and worker basics exist.

Recommended route:

  • /admin/vendor-ingestion

Access control:

  • Server-side only.
  • Supabase-authenticated user required.
  • Admin allowlist via INGESTION_ADMIN_EMAILS env var for v1.
  • Later replace with an admin_users or role table if needed.

Views:

  • Sync source list.
  • Run history.
  • Run detail.
  • Product review queue.
  • Product diff preview.
  • Publish event audit log.

Server actions:

  • approveIngestedProduct
  • rejectIngestedProduct
  • assignPeptideMatch
  • createAliasRule
  • linkExistingVendorProduct
  • publishApprovedRun

The client must not receive Vault secrets or service-role credentials.

Rollout Plan

Phase 1: Schema and type alignment

  • Add Supabase migrations for ingestion tables, indexes, RLS, and service-role-only Vault RPCs.
  • Regenerate /Users/wells/Documents/GitHub/peptide-portal/lib/supabase/database.types.ts.
  • Confirm live Finnrick tables are represented locally so generated types match MCP reality.

Phase 2: Worker skeleton

  • Create workers/vendor-sync.
  • Add Bun scripts and TypeScript config.
  • Add Supabase service-role client.
  • Add run creation/update helpers.
  • Add deterministic hashing and artifact naming.
  • Add dry-run mode.

Phase 3: Manual/CSV MVP

  • Build manual-csv adapter for current vendor CSVs.
  • Stage rows into ingestion tables.
  • Prove normalization and review flow without browser complexity.

Phase 4: Publish path

  • Build publish planner that produces diffs without applying them.
  • Build approved-only publisher.
  • Verify vendor_listings reflects published rows only after approval.

Phase 5: Public HTTP adapters

  • Add sitemap/JSON-LD/static HTML crawling through Crawlee HTTP/Cheerio flows.
  • Add per-vendor config for selectors and URL patterns.

Phase 6: Authenticated portal adapter

  • Add Playwright login/session handling.
  • Store credentials and sessions in Vault.
  • Run one gated vendor proof of concept.
  • Capture screenshots and failure artifacts for login drift.

Phase 7: Admin UI

  • Add internal review dashboard and server actions.
  • Keep all writes server-only.
  • Add audit visibility for publish events.

Testing Plan

Unit tests:

  • Price parser.
  • Size parser.
  • price_per_mg calculation.
  • Peptide alias normalization.
  • Deterministic match precedence.
  • Product hash stability.
  • Publish diff generation.

Fixture tests:

  • CSV import fixture.
  • Public HTML product grid fixture.
  • Product detail page fixture.
  • JSON-LD product fixture.
  • Authenticated portal fixture using saved mock HTML, not live credentials.

Integration tests:

  • Dry-run sync creates one vendor_sync_runs row.
  • Dry-run sync creates ingested products and variants.
  • Dry-run sync does not mutate vendors, vendor_products, or vial_options.
  • Approved publish creates or updates public catalog rows.
  • Re-running the same source does not create duplicate public products.
  • Missing products are not deactivated until the consecutive-miss threshold is reached.

Operational tests:

  • Failed login marks run failed/partial with sanitized error.
  • Expired session retries login once.
  • Captcha/2FA gate stops the run without bypass.
  • Secrets do not appear in logs.
  • Worker can be run locally and in a container.

Compliance and Safety

  • Use only authorized access.
  • Do not bypass technical access controls.
  • Do not scrape sources that prohibit the intended use.
  • Add vendor-specific rate limits.
  • Keep affiliate/commercial relationships auditable.
  • Preserve source provenance for every published listing.
  • Do not infer medical claims from scraped product pages.
  • Legal review is recommended before publishing automated vendor-derived catalog data at scale.

Non-Goals

  • No AI matching or AI extraction.
  • No direct public writes from client components.
  • No browser automation inside Next.js request handlers.
  • No automatic full publish in v1.
  • No captcha bypass.
  • No replacing the current marketplace UX as part of ingestion v1.

References

Assumptions

  • The plan will later be stored at /Users/wells/Documents/GitHub/peptide-portal/docs/vendor-ingestion-architecture.md.
  • Bun is used for the ingestion worker, while the existing Next app remains pnpm-managed.
  • Supabase Vault is the credential/session store.
  • Publishing is review-first in v1.
  • The first useful implementation target is CSV/manual ingestion, followed by one public HTTP adapter, then one authenticated Playwright adapter.
Vendor Ingestion Architecture | MDX Limo