Vendor Ingestion Architecture

Recommended file: /Users/wells/Documents/GitHub/peptide-portal/docs/vendor-ingestion-architecture.md

Summary

Build a review-first vendor ingestion pipeline that fits the current Next.js + Supabase app without putting scraping work inside the web app. The system will use a scheduled Bun worker for ingestion, Supabase Vault for credentials and browser sessions, and staged Supabase tables before anything reaches public marketplace tables.

The architecture supports slow vendor access lead time by starting with manual CSV/export/public sources, then layering in authenticated portal crawling as credentials arrive.

No AI is used. Product matching, parsing, normalization, and publishing must be deterministic. Ambiguous records go to human review.

Current Architecture Snapshot

Verified through local repo inspection and Supabase MCP on May 12, 2026.

App stack: Next.js 16.2.4, React 19.2.4, App Router, Supabase Auth/DB, Tailwind 4, shadcn/Radix UI, pnpm root package.
Live Supabase project: Peptides / eorheiigbzfqbkjizmox.
Core catalog tables:
- vendors: 14 rows.
- peptides: 58 rows.
- categories: 9 rows.
- vendor_products: 36 rows.
- vial_options: 75 rows.
- vendor_listings: 75 visible listing rows.
Existing Finnrick tables exist live:
- finnrick_vendor_mapping: 14 rows.
- finnrick_peptide_mapping: 58 rows.
- finnrick_vendor_stats: 12 rows.
- finnrick_product_tests: 102 rows.
Supabase extensions available:
- supabase_vault 0.3.1.
- pg_cron 1.6.4.
- pg_net 0.20.0.
Supabase Vault is installed. MCP confirmed vault.secrets exists and contains one row, but secret values were not read.
Public RLS policies allow reads for active catalog data, but there are no public write policies for core catalog tables. This is good: ingestion should write with service-role/server-only access.

Existing Catalog Contract

The current app displays marketplace data through vendor_listings, which joins:

vendor_products.vendor_id -> vendors.id
vendor_products.peptide_id -> peptides.id
vial_options.vendor_product_id -> vendor_products.id

A listing is visible only when:

vendors.is_active = true
vendor_products.is_active = true
vial_options.is_available = true

Publishing must therefore produce or update:

vendors only when onboarding a new vendor.
vendor_products for product-level data: vendor_id, peptide_id, name, category, purity, salt, product_url, is_active.
vial_options for sellable variants: vendor_product_id, size, price, price_per_mg, is_available, sort_order.

Important: app-level peptide routes use slugs, but database relationships require peptides.id UUIDs. The worker must map scraped product names to peptide UUIDs before publishing.

Architecture

The Next app remains the reader/reviewer UI. The Bun worker owns crawling, parsing, normalization, retries, and staged writes.

Source Ladder

Every vendor adapter should try sources in this order:

Manual seed files: existing CSVs in /Users/wells/Documents/GitHub/peptide-portal/docs/vendors.
Vendor-provided exports: CSV, XLSX, JSON, XML, affiliate feeds.
Partner/API access: Shopify Admin API, WooCommerce REST API, custom vendor APIs.
Public structured crawl: sitemap, product JSON-LD, embedded product JSON, static HTML.
Authenticated browser crawl: Playwright through Crawlee for login-gated catalogs.
Managed browser infrastructure later if local browser jobs become flaky or blocked.

Playwright is the fallback for gated, JavaScript-heavy, or session-dependent portals. It should not be the default for sources that can be fetched directly.

Runtime Decision

Use a scheduled Bun worker, not an always-on sidecar.

Worker path: /Users/wells/Documents/GitHub/peptide-portal/workers/vendor-sync.
Runtime: Bun.
Browser framework: Crawlee with Playwright adapters only where required.
Deployment target: container host or scheduled job runner, not Vercel request handlers.
Trigger modes:
- Manual CLI for development.
- Scheduled cron on the worker host for production.
- Optional later: Supabase pg_cron + pg_net can call a worker webhook, but the database must not run browser jobs.

Do not run authenticated browser automation inside:

Next.js route handlers.
Vercel serverless functions.
Supabase Edge Functions.

Proposed Database Additions

Create new private-by-policy tables in public so Supabase JS service-role access works without extra exposed-schema configuration. Enable RLS on all new tables and add no anon/auth public policies. Access should happen only through service-role worker code and server/admin review paths.

`vendor_sync_sources`

Stores one configured source per vendor per ingestion method.

Required columns:

id uuid primary key default gen_random_uuid()
vendor_id uuid references vendors(id)
display_name text not null
adapter_key text not null
source_type text not null
- allowed: manual_csv, manual_export, public_http, platform_api, authenticated_browser, finnrick, other
base_url text
login_url text
catalog_url text
credential_secret_name text
session_secret_name text
config jsonb default '{}'
priority integer default 100
enabled boolean default false
publish_mode text default 'review'
- allowed: dry_run, review, auto_safe
last_success_at timestamptz
created_at timestamptz default now()
updated_at timestamptz default now()

Indexes:

unique (vendor_id, adapter_key)
index (enabled, priority)
index (source_type)

`vendor_sync_runs`

Tracks every ingestion attempt.

Columns:

id uuid primary key default gen_random_uuid()
source_id uuid references vendor_sync_sources(id)
vendor_id uuid references vendors(id)
status text not null
- allowed: queued, running, succeeded, partial, failed, cancelled
trigger_type text not null
- allowed: manual, schedule, webhook, retry
mode text not null
- allowed: dry_run, review, publish
started_at timestamptz default now()
completed_at timestamptz
duration_ms integer
products_seen integer default 0
products_extracted integer default 0
products_normalized integer default 0
variants_seen integer default 0
needs_review integer default 0
approved integer default 0
published integer default 0
failed integer default 0
error_summary text
error_details jsonb
artifact_prefix text
created_at timestamptz default now()

`vendor_ingested_products`

Stores raw and normalized product-level records for each run.

Columns:

id uuid primary key default gen_random_uuid()
run_id uuid references vendor_sync_runs(id)
source_id uuid references vendor_sync_sources(id)
vendor_id uuid references vendors(id)
external_product_key text not null
source_url text
source_sku text
raw_name text not null
raw_payload jsonb default '{}'
raw_payload_hash text not null
raw_html_hash text
normalized_name text
product_url text
peptide_candidate_name text
peptide_id uuid references peptides(id)
peptide_match_status text not null default 'unknown'
- allowed: matched, ambiguous, unknown, ignored, rejected
peptide_match_rule text
peptide_match_notes text
category text
purity_text text
purity_percent numeric
salt text
coa_url text
stock_status text
normalized_hash text
review_status text not null default 'pending'
- allowed: pending, approved, rejected, published, superseded
warnings jsonb default '[]'
created_at timestamptz default now()

Indexes:

unique (run_id, external_product_key)
index (source_id, external_product_key)
index (vendor_id, review_status)
index (peptide_id)

`vendor_ingested_variants`

Stores vial/variant records under ingested products.

Columns:

id uuid primary key default gen_random_uuid()
ingested_product_id uuid references vendor_ingested_products(id)
run_id uuid references vendor_sync_runs(id)
source_id uuid references vendor_sync_sources(id)
vendor_id uuid references vendors(id)
external_variant_key text not null
label text
size_text text not null
amount_mg numeric
vial_count integer default 1
total_mg numeric
price numeric
currency text default 'USD'
price_per_mg numeric
is_available boolean default true
sku text
raw_payload jsonb default '{}'
warnings jsonb default '[]'
created_at timestamptz default now()

Indexes:

unique (ingested_product_id, external_variant_key)
index (run_id)
index (vendor_id)

`vendor_peptide_alias_rules`

Human-maintained deterministic matching rules. This avoids AI matching.

Columns:

id uuid primary key default gen_random_uuid()
source_id uuid references vendor_sync_sources(id)
alias_normalized text not null
peptide_id uuid references peptides(id)
status text default 'active'
- allowed: active, disabled
notes text
created_at timestamptz default now()
updated_at timestamptz default now()

Indexes:

unique (source_id, alias_normalized)
index (peptide_id)

Matching precedence:

Vendor-specific alias rule from vendor_peptide_alias_rules.
Exact normalized peptides.name.
Exact normalized peptides.slug.
Exact normalized values from peptides.aliases.
Existing Finnrick mapping when source is Finnrick-derived.
Otherwise unknown or ambiguous, requiring review.

`vendor_product_source_links`

Maps a vendor source’s stable external product key to the public vendor_products row. This prevents duplicate public products without adding risky uniqueness constraints to existing catalog tables.

Columns:

id uuid primary key default gen_random_uuid()
source_id uuid references vendor_sync_sources(id)
external_product_key text not null
vendor_product_id uuid references vendor_products(id)
first_seen_run_id uuid references vendor_sync_runs(id)
last_seen_run_id uuid references vendor_sync_runs(id)
link_status text default 'active'
- allowed: active, inactive, needs_review
created_at timestamptz default now()
updated_at timestamptz default now()

Indexes:

unique (source_id, external_product_key)
index (vendor_product_id)

`vendor_variant_source_links`

Maps external variant keys to public vial_options.

Columns:

id uuid primary key default gen_random_uuid()
source_id uuid references vendor_sync_sources(id)
external_variant_key text not null
vial_option_id uuid references vial_options(id)
vendor_product_id uuid references vendor_products(id)
first_seen_run_id uuid references vendor_sync_runs(id)
last_seen_run_id uuid references vendor_sync_runs(id)
link_status text default 'active'
created_at timestamptz default now()
updated_at timestamptz default now()

Indexes:

unique (source_id, external_variant_key)
index (vial_option_id)

`vendor_product_observations`

Keeps historical price/availability observations separate from public listing state.

Columns:

id uuid primary key default gen_random_uuid()
run_id uuid references vendor_sync_runs(id)
vendor_product_id uuid references vendor_products(id)
vial_option_id uuid references vial_options(id)
ingested_product_id uuid references vendor_ingested_products(id)
ingested_variant_id uuid references vendor_ingested_variants(id)
price numeric
price_per_mg numeric
is_available boolean
observed_at timestamptz default now()
source_url text
payload_hash text

Indexes:

index (vendor_product_id, observed_at desc)
index (vial_option_id, observed_at desc)

`vendor_publish_events`

Audits every change made to public catalog tables by the ingestion workflow.

Columns:

id uuid primary key default gen_random_uuid()
run_id uuid references vendor_sync_runs(id)
source_id uuid references vendor_sync_sources(id)
vendor_id uuid references vendors(id)
ingested_product_id uuid references vendor_ingested_products(id)
action text not null
- allowed: create_product, update_product, create_variant, update_variant, mark_unavailable, mark_inactive, skip
target_table text not null
target_id uuid
before_json jsonb
after_json jsonb
actor_type text not null
- allowed: worker, admin
actor_user_id uuid references profiles(id)
created_at timestamptz default now()

Vault Design

Use Supabase Vault for credentials and browser sessions.

Secret naming convention:

Credentials: vendor-sync:{vendor_slug}:{adapter_key}:credentials
Browser session state: vendor-sync:{vendor_slug}:{adapter_key}:playwright-session
Optional TOTP seed only if explicitly provided and legally allowed: vendor-sync:{vendor_slug}:{adapter_key}:totp

Credential secret JSON shape:

{
  "username": "vendor-login@example.com",
  "password": "redacted",
  "notes": "Authorized vendor portal credentials",
  "otpMode": "none"
}

Playwright session secret JSON shape:

{
  "storageState": {},
  "savedAt": "2026-05-12T00:00:00.000Z",
  "expiresAt": null,
  "loginUrl": "https://vendor.example.com/login"
}

Access pattern:

The worker never stores secrets in normal tables.
vendor_sync_sources stores only Vault secret names.
Add service-role-only RPCs:
- get_vendor_sync_secret(secret_name text) returns text
- upsert_vendor_sync_secret(secret_name text, secret_value text, description text) returns uuid
Revoke execution from public, anon, and authenticated.
Grant execution only to service_role.
Never log secret values, cookies, authorization headers, or full browser storage state.

Worker Design

Suggested folder structure:

workers/vendor-sync/
  package.json
  bun.lock
  src/
    index.ts
    cli.ts
    config/env.ts
    db/supabase.ts
    db/vault.ts
    db/runs.ts
    adapters/base.ts
    adapters/manual-csv.ts
    adapters/public-http.ts
    adapters/platform-api.ts
    adapters/authenticated-browser.ts
    adapters/vendors/
    normalize/product.ts
    normalize/price.ts
    normalize/peptide-match.ts
    publish/plan.ts
    publish/apply.ts
    artifacts/store.ts
    logging.ts
  fixtures/
  tests/

Core commands:

bun run sync -- --source=<source-id> --mode=dry_run
bun run sync -- --source=<source-id> --mode=review
bun run sync:vendor -- --vendor=<vendor-slug> --mode=review
bun run publish -- --run=<run-id> --approved-only
bun run audit -- --vendor=<vendor-slug>

Environment variables:

SUPABASE_URL
SUPABASE_SERVICE_ROLE_KEY
VENDOR_SYNC_ARTIFACT_BUCKET
VENDOR_SYNC_DEFAULT_TIMEOUT_MS
VENDOR_SYNC_CONCURRENCY
VENDOR_SYNC_USER_AGENT

Adapter Interface

Every adapter returns deterministic extracted records.

type VendorAdapterContext = {
  sourceId: string
  vendorId: string
  sourceConfig: Record<string, unknown>
  baseUrl?: string
  loginUrl?: string
  catalogUrl?: string
  getSecret(name: string): Promise<string | null>
  saveSecret(name: string, value: string): Promise<void>
  artifactPrefix: string
}

type ExtractedVendorProduct = {
  externalProductKey: string
  rawName: string
  sourceUrl?: string
  productUrl?: string
  sourceSku?: string
  category?: string
  purityText?: string
  salt?: string
  coaUrl?: string
  stockStatus?: string
  rawPayload: Record<string, unknown>
  variants: ExtractedVendorVariant[]
}

type ExtractedVendorVariant = {
  externalVariantKey: string
  label?: string
  sizeText: string
  amountMg?: number
  vialCount?: number
  totalMg?: number
  price?: number
  currency?: "USD"
  isAvailable?: boolean
  sku?: string
  rawPayload?: Record<string, unknown>
}

Authenticated Browser Flow

Load vendor_sync_sources row.
Fetch credential secret from Vault.
Fetch existing Playwright storageState secret if available.
Start browser context with saved session.
Visit catalog URL.
If login gate appears, run adapter-specific login.
If captcha, unsupported 2FA, or account approval appears, stop and mark run partial or failed.
After successful login, save updated storageState back to Vault.
Crawl product list and product detail pages with strict rate limits.
Save screenshots/HTML hashes on failures.
Emit extracted records.

Rules:

Do not bypass captchas.
Do not circumvent access controls.
Do not automate around forbidden 2FA flows.
Use only authorized credentials.
Respect vendor terms, crawl delays, and rate limits.
Treat cookies/session state as secrets.

Normalization Rules

Product normalization must be deterministic:

Normalize whitespace, punctuation, unicode fractions, and case.
Preserve original vendor names in raw_name.
Extract size patterns like 5mg, 10 mg, 2 x 5mg, 10mg vial, 30 mg kit.
Compute:
- amount_mg
- vial_count
- total_mg
- price_per_mg = price / total_mg
Parse purity only from explicit values like 99%, >99%, 99.5%.
Preserve uncertain purity as purity_text and warn instead of inventing values.
Map salts only from explicit terms such as acetate, tfa, hcl, base.
Normalize availability into in_stock, out_of_stock, backorder, unknown.
Unknown peptide matches stay staged and never publish.

Review-First Publishing

V1 publishing policy: review required before public listing changes.

Review queue shows:

Vendor/source/run.
Raw vendor product name.
Normalized name.
Candidate peptide match.
Parsed variants and prices.
COA/product URLs.
Warnings.
Diff against existing public listing, if linked.

Reviewer actions:

Approve product.
Reject product.
Assign peptide.
Add vendor-specific alias rule.
Link to existing vendor_products.
Create new public product on publish.
Approve specific variants.
Reject specific variants.

Publishing behavior:

Approved product creates or updates vendor_products.
Approved variants create or update vial_options.
Missing products are not immediately deactivated.
A product or variant should only be marked inactive/unavailable after a configurable number of consecutive misses, default 3.
Every public-table mutation writes a vendor_publish_events row.

Admin UI Integration

Add a small internal admin surface after schema and worker basics exist.

Recommended route:

/admin/vendor-ingestion

Access control:

Server-side only.
Supabase-authenticated user required.
Admin allowlist via INGESTION_ADMIN_EMAILS env var for v1.
Later replace with an admin_users or role table if needed.

Views:

Sync source list.
Run history.
Run detail.
Product review queue.
Product diff preview.
Publish event audit log.

Server actions:

approveIngestedProduct
rejectIngestedProduct
assignPeptideMatch
createAliasRule
linkExistingVendorProduct
publishApprovedRun

The client must not receive Vault secrets or service-role credentials.

Rollout Plan

Phase 1: Schema and type alignment

Add Supabase migrations for ingestion tables, indexes, RLS, and service-role-only Vault RPCs.
Regenerate /Users/wells/Documents/GitHub/peptide-portal/lib/supabase/database.types.ts.
Confirm live Finnrick tables are represented locally so generated types match MCP reality.

Phase 2: Worker skeleton

Create workers/vendor-sync.
Add Bun scripts and TypeScript config.
Add Supabase service-role client.
Add run creation/update helpers.
Add deterministic hashing and artifact naming.
Add dry-run mode.

Phase 3: Manual/CSV MVP

Build manual-csv adapter for current vendor CSVs.
Stage rows into ingestion tables.
Prove normalization and review flow without browser complexity.

Phase 4: Publish path

Build publish planner that produces diffs without applying them.
Build approved-only publisher.
Verify vendor_listings reflects published rows only after approval.

Phase 5: Public HTTP adapters

Add sitemap/JSON-LD/static HTML crawling through Crawlee HTTP/Cheerio flows.
Add per-vendor config for selectors and URL patterns.

Phase 6: Authenticated portal adapter

Add Playwright login/session handling.
Store credentials and sessions in Vault.
Run one gated vendor proof of concept.
Capture screenshots and failure artifacts for login drift.

Phase 7: Admin UI

Add internal review dashboard and server actions.
Keep all writes server-only.
Add audit visibility for publish events.

Testing Plan

Unit tests:

Price parser.
Size parser.
price_per_mg calculation.
Peptide alias normalization.
Deterministic match precedence.
Product hash stability.
Publish diff generation.

Fixture tests:

CSV import fixture.
Public HTML product grid fixture.
Product detail page fixture.
JSON-LD product fixture.
Authenticated portal fixture using saved mock HTML, not live credentials.

Integration tests:

Dry-run sync creates one vendor_sync_runs row.
Dry-run sync creates ingested products and variants.
Dry-run sync does not mutate vendors, vendor_products, or vial_options.
Approved publish creates or updates public catalog rows.
Re-running the same source does not create duplicate public products.
Missing products are not deactivated until the consecutive-miss threshold is reached.

Operational tests:

Failed login marks run failed/partial with sanitized error.
Expired session retries login once.
Captcha/2FA gate stops the run without bypass.
Secrets do not appear in logs.
Worker can be run locally and in a container.

Compliance and Safety

Use only authorized access.
Do not bypass technical access controls.
Do not scrape sources that prohibit the intended use.
Add vendor-specific rate limits.
Keep affiliate/commercial relationships auditable.
Preserve source provenance for every published listing.
Do not infer medical claims from scraped product pages.
Legal review is recommended before publishing automated vendor-derived catalog data at scale.

Non-Goals

No AI matching or AI extraction.
No direct public writes from client components.
No browser automation inside Next.js request handlers.
No automatic full publish in v1.
No captcha bypass.
No replacing the current marketplace UX as part of ingestion v1.

References

Supabase Vault: https://supabase.com/docs/guides/database/vault
Bun docs: https://bun.sh/docs
Crawlee docs: https://crawlee.dev/docs/introduction
Playwright browser/Docker docs: https://playwright.dev/docs/docker
Shopify Admin GraphQL API: https://shopify.dev/docs/api/admin-graphql
WooCommerce REST API: https://developer.woocommerce.com/docs/apis/rest-api/
Sitemaps protocol: https://www.sitemaps.org/en_GB/protocol.html
Schema.org Product: https://schema.org/Product

Assumptions

The plan will later be stored at /Users/wells/Documents/GitHub/peptide-portal/docs/vendor-ingestion-architecture.md.
Bun is used for the ingestion worker, while the existing Next app remains pnpm-managed.
Supabase Vault is the credential/session store.
Publishing is review-first in v1.
The first useful implementation target is CSV/manual ingestion, followed by one public HTTP adapter, then one authenticated Playwright adapter.