Vendor Ingestion Architecture
Recommended file: /Users/wells/Documents/GitHub/peptide-portal/docs/vendor-ingestion-architecture.md
Summary
Build a review-first vendor ingestion pipeline that fits the current Next.js + Supabase app without putting scraping work inside the web app. The system will use a scheduled Bun worker for ingestion, Supabase Vault for credentials and browser sessions, and staged Supabase tables before anything reaches public marketplace tables.
The architecture supports slow vendor access lead time by starting with manual CSV/export/public sources, then layering in authenticated portal crawling as credentials arrive.
No AI is used. Product matching, parsing, normalization, and publishing must be deterministic. Ambiguous records go to human review.
Current Architecture Snapshot
Verified through local repo inspection and Supabase MCP on May 12, 2026.
- App stack: Next.js 16.2.4, React 19.2.4, App Router, Supabase Auth/DB, Tailwind 4, shadcn/Radix UI, pnpm root package.
- Live Supabase project:
Peptides/eorheiigbzfqbkjizmox. - Core catalog tables:
vendors: 14 rows.peptides: 58 rows.categories: 9 rows.vendor_products: 36 rows.vial_options: 75 rows.vendor_listings: 75 visible listing rows.
- Existing Finnrick tables exist live:
finnrick_vendor_mapping: 14 rows.finnrick_peptide_mapping: 58 rows.finnrick_vendor_stats: 12 rows.finnrick_product_tests: 102 rows.
- Supabase extensions available:
supabase_vault0.3.1.pg_cron1.6.4.pg_net0.20.0.
- Supabase Vault is installed. MCP confirmed
vault.secretsexists and contains one row, but secret values were not read. - Public RLS policies allow reads for active catalog data, but there are no public write policies for core catalog tables. This is good: ingestion should write with service-role/server-only access.
Existing Catalog Contract
The current app displays marketplace data through vendor_listings, which joins:
vendor_products.vendor_id -> vendors.idvendor_products.peptide_id -> peptides.idvial_options.vendor_product_id -> vendor_products.id
A listing is visible only when:
vendors.is_active = truevendor_products.is_active = truevial_options.is_available = true
Publishing must therefore produce or update:
vendorsonly when onboarding a new vendor.vendor_productsfor product-level data:vendor_id,peptide_id,name,category,purity,salt,product_url,is_active.vial_optionsfor sellable variants:vendor_product_id,size,price,price_per_mg,is_available,sort_order.
Important: app-level peptide routes use slugs, but database relationships require peptides.id UUIDs. The worker must map scraped product names to peptide UUIDs before publishing.
Architecture
The Next app remains the reader/reviewer UI. The Bun worker owns crawling, parsing, normalization, retries, and staged writes.
Source Ladder
Every vendor adapter should try sources in this order:
- Manual seed files: existing CSVs in
/Users/wells/Documents/GitHub/peptide-portal/docs/vendors. - Vendor-provided exports: CSV, XLSX, JSON, XML, affiliate feeds.
- Partner/API access: Shopify Admin API, WooCommerce REST API, custom vendor APIs.
- Public structured crawl: sitemap, product JSON-LD, embedded product JSON, static HTML.
- Authenticated browser crawl: Playwright through Crawlee for login-gated catalogs.
- Managed browser infrastructure later if local browser jobs become flaky or blocked.
Playwright is the fallback for gated, JavaScript-heavy, or session-dependent portals. It should not be the default for sources that can be fetched directly.
Runtime Decision
Use a scheduled Bun worker, not an always-on sidecar.
- Worker path:
/Users/wells/Documents/GitHub/peptide-portal/workers/vendor-sync. - Runtime: Bun.
- Browser framework: Crawlee with Playwright adapters only where required.
- Deployment target: container host or scheduled job runner, not Vercel request handlers.
- Trigger modes:
- Manual CLI for development.
- Scheduled cron on the worker host for production.
- Optional later: Supabase
pg_cron+pg_netcan call a worker webhook, but the database must not run browser jobs.
Do not run authenticated browser automation inside:
- Next.js route handlers.
- Vercel serverless functions.
- Supabase Edge Functions.
Proposed Database Additions
Create new private-by-policy tables in public so Supabase JS service-role access works without extra exposed-schema configuration. Enable RLS on all new tables and add no anon/auth public policies. Access should happen only through service-role worker code and server/admin review paths.
vendor_sync_sources
Stores one configured source per vendor per ingestion method.
Required columns:
id uuid primary key default gen_random_uuid()vendor_id uuid references vendors(id)display_name text not nulladapter_key text not nullsource_type text not null- allowed:
manual_csv,manual_export,public_http,platform_api,authenticated_browser,finnrick,other
- allowed:
base_url textlogin_url textcatalog_url textcredential_secret_name textsession_secret_name textconfig jsonb default '{}'priority integer default 100enabled boolean default falsepublish_mode text default 'review'- allowed:
dry_run,review,auto_safe
- allowed:
last_success_at timestamptzcreated_at timestamptz default now()updated_at timestamptz default now()
Indexes:
- unique
(vendor_id, adapter_key) - index
(enabled, priority) - index
(source_type)
vendor_sync_runs
Tracks every ingestion attempt.
Columns:
id uuid primary key default gen_random_uuid()source_id uuid references vendor_sync_sources(id)vendor_id uuid references vendors(id)status text not null- allowed:
queued,running,succeeded,partial,failed,cancelled
- allowed:
trigger_type text not null- allowed:
manual,schedule,webhook,retry
- allowed:
mode text not null- allowed:
dry_run,review,publish
- allowed:
started_at timestamptz default now()completed_at timestamptzduration_ms integerproducts_seen integer default 0products_extracted integer default 0products_normalized integer default 0variants_seen integer default 0needs_review integer default 0approved integer default 0published integer default 0failed integer default 0error_summary texterror_details jsonbartifact_prefix textcreated_at timestamptz default now()
vendor_ingested_products
Stores raw and normalized product-level records for each run.
Columns:
id uuid primary key default gen_random_uuid()run_id uuid references vendor_sync_runs(id)source_id uuid references vendor_sync_sources(id)vendor_id uuid references vendors(id)external_product_key text not nullsource_url textsource_sku textraw_name text not nullraw_payload jsonb default '{}'raw_payload_hash text not nullraw_html_hash textnormalized_name textproduct_url textpeptide_candidate_name textpeptide_id uuid references peptides(id)peptide_match_status text not null default 'unknown'- allowed:
matched,ambiguous,unknown,ignored,rejected
- allowed:
peptide_match_rule textpeptide_match_notes textcategory textpurity_text textpurity_percent numericsalt textcoa_url textstock_status textnormalized_hash textreview_status text not null default 'pending'- allowed:
pending,approved,rejected,published,superseded
- allowed:
warnings jsonb default '[]'created_at timestamptz default now()
Indexes:
- unique
(run_id, external_product_key) - index
(source_id, external_product_key) - index
(vendor_id, review_status) - index
(peptide_id)
vendor_ingested_variants
Stores vial/variant records under ingested products.
Columns:
id uuid primary key default gen_random_uuid()ingested_product_id uuid references vendor_ingested_products(id)run_id uuid references vendor_sync_runs(id)source_id uuid references vendor_sync_sources(id)vendor_id uuid references vendors(id)external_variant_key text not nulllabel textsize_text text not nullamount_mg numericvial_count integer default 1total_mg numericprice numericcurrency text default 'USD'price_per_mg numericis_available boolean default truesku textraw_payload jsonb default '{}'warnings jsonb default '[]'created_at timestamptz default now()
Indexes:
- unique
(ingested_product_id, external_variant_key) - index
(run_id) - index
(vendor_id)
vendor_peptide_alias_rules
Human-maintained deterministic matching rules. This avoids AI matching.
Columns:
id uuid primary key default gen_random_uuid()source_id uuid references vendor_sync_sources(id)alias_normalized text not nullpeptide_id uuid references peptides(id)status text default 'active'- allowed:
active,disabled
- allowed:
notes textcreated_at timestamptz default now()updated_at timestamptz default now()
Indexes:
- unique
(source_id, alias_normalized) - index
(peptide_id)
Matching precedence:
- Vendor-specific alias rule from
vendor_peptide_alias_rules. - Exact normalized
peptides.name. - Exact normalized
peptides.slug. - Exact normalized values from
peptides.aliases. - Existing Finnrick mapping when source is Finnrick-derived.
- Otherwise
unknownorambiguous, requiring review.
vendor_product_source_links
Maps a vendor source’s stable external product key to the public vendor_products row. This prevents duplicate public products without adding risky uniqueness constraints to existing catalog tables.
Columns:
id uuid primary key default gen_random_uuid()source_id uuid references vendor_sync_sources(id)external_product_key text not nullvendor_product_id uuid references vendor_products(id)first_seen_run_id uuid references vendor_sync_runs(id)last_seen_run_id uuid references vendor_sync_runs(id)link_status text default 'active'- allowed:
active,inactive,needs_review
- allowed:
created_at timestamptz default now()updated_at timestamptz default now()
Indexes:
- unique
(source_id, external_product_key) - index
(vendor_product_id)
vendor_variant_source_links
Maps external variant keys to public vial_options.
Columns:
id uuid primary key default gen_random_uuid()source_id uuid references vendor_sync_sources(id)external_variant_key text not nullvial_option_id uuid references vial_options(id)vendor_product_id uuid references vendor_products(id)first_seen_run_id uuid references vendor_sync_runs(id)last_seen_run_id uuid references vendor_sync_runs(id)link_status text default 'active'created_at timestamptz default now()updated_at timestamptz default now()
Indexes:
- unique
(source_id, external_variant_key) - index
(vial_option_id)
vendor_product_observations
Keeps historical price/availability observations separate from public listing state.
Columns:
id uuid primary key default gen_random_uuid()run_id uuid references vendor_sync_runs(id)vendor_product_id uuid references vendor_products(id)vial_option_id uuid references vial_options(id)ingested_product_id uuid references vendor_ingested_products(id)ingested_variant_id uuid references vendor_ingested_variants(id)price numericprice_per_mg numericis_available booleanobserved_at timestamptz default now()source_url textpayload_hash text
Indexes:
- index
(vendor_product_id, observed_at desc) - index
(vial_option_id, observed_at desc)
vendor_publish_events
Audits every change made to public catalog tables by the ingestion workflow.
Columns:
id uuid primary key default gen_random_uuid()run_id uuid references vendor_sync_runs(id)source_id uuid references vendor_sync_sources(id)vendor_id uuid references vendors(id)ingested_product_id uuid references vendor_ingested_products(id)action text not null- allowed:
create_product,update_product,create_variant,update_variant,mark_unavailable,mark_inactive,skip
- allowed:
target_table text not nulltarget_id uuidbefore_json jsonbafter_json jsonbactor_type text not null- allowed:
worker,admin
- allowed:
actor_user_id uuid references profiles(id)created_at timestamptz default now()
Vault Design
Use Supabase Vault for credentials and browser sessions.
Secret naming convention:
- Credentials:
vendor-sync:{vendor_slug}:{adapter_key}:credentials - Browser session state:
vendor-sync:{vendor_slug}:{adapter_key}:playwright-session - Optional TOTP seed only if explicitly provided and legally allowed:
vendor-sync:{vendor_slug}:{adapter_key}:totp
Credential secret JSON shape:
1{
2 "username": "vendor-login@example.com",
3 "password": "redacted",
4 "notes": "Authorized vendor portal credentials",
5 "otpMode": "none"
6}Playwright session secret JSON shape:
1{
2 "storageState": {},
3 "savedAt": "2026-05-12T00:00:00.000Z",
4 "expiresAt": null,
5 "loginUrl": "https://vendor.example.com/login"
6}Access pattern:
- The worker never stores secrets in normal tables.
vendor_sync_sourcesstores only Vault secret names.- Add service-role-only RPCs:
get_vendor_sync_secret(secret_name text) returns textupsert_vendor_sync_secret(secret_name text, secret_value text, description text) returns uuid
- Revoke execution from
public,anon, andauthenticated. - Grant execution only to
service_role. - Never log secret values, cookies, authorization headers, or full browser storage state.
Worker Design
Suggested folder structure:
1workers/vendor-sync/
2 package.json
3 bun.lock
4 src/
5 index.ts
6 cli.ts
7 config/env.ts
8 db/supabase.ts
9 db/vault.ts
10 db/runs.ts
11 adapters/base.ts
12 adapters/manual-csv.ts
13 adapters/public-http.ts
14 adapters/platform-api.ts
15 adapters/authenticated-browser.ts
16 adapters/vendors/
17 normalize/product.ts
18 normalize/price.ts
19 normalize/peptide-match.ts
20 publish/plan.ts
21 publish/apply.ts
22 artifacts/store.ts
23 logging.ts
24 fixtures/
25 tests/Core commands:
1bun run sync -- --source=<source-id> --mode=dry_run
2bun run sync -- --source=<source-id> --mode=review
3bun run sync:vendor -- --vendor=<vendor-slug> --mode=review
4bun run publish -- --run=<run-id> --approved-only
5bun run audit -- --vendor=<vendor-slug>Environment variables:
SUPABASE_URLSUPABASE_SERVICE_ROLE_KEYVENDOR_SYNC_ARTIFACT_BUCKETVENDOR_SYNC_DEFAULT_TIMEOUT_MSVENDOR_SYNC_CONCURRENCYVENDOR_SYNC_USER_AGENT
Adapter Interface
Every adapter returns deterministic extracted records.
1type VendorAdapterContext = {
2 sourceId: string
3 vendorId: string
4 sourceConfig: Record<string, unknown>
5 baseUrl?: string
6 loginUrl?: string
7 catalogUrl?: string
8 getSecret(name: string): Promise<string | null>
9 saveSecret(name: string, value: string): Promise<void>
10 artifactPrefix: string
11}
12
13type ExtractedVendorProduct = {
14 externalProductKey: string
15 rawName: string
16 sourceUrl?: string
17 productUrl?: string
18 sourceSku?: string
19 category?: string
20 purityText?: string
21 salt?: string
22 coaUrl?: string
23 stockStatus?: string
24 rawPayload: Record<string, unknown>
25 variants: ExtractedVendorVariant[]
26}
27
28type ExtractedVendorVariant = {
29 externalVariantKey: string
30 label?: string
31 sizeText: string
32 amountMg?: number
33 vialCount?: number
34 totalMg?: number
35 price?: number
36 currency?: "USD"
37 isAvailable?: boolean
38 sku?: string
39 rawPayload?: Record<string, unknown>
40}Authenticated Browser Flow
- Load
vendor_sync_sourcesrow. - Fetch credential secret from Vault.
- Fetch existing Playwright
storageStatesecret if available. - Start browser context with saved session.
- Visit catalog URL.
- If login gate appears, run adapter-specific login.
- If captcha, unsupported 2FA, or account approval appears, stop and mark run
partialorfailed. - After successful login, save updated
storageStateback to Vault. - Crawl product list and product detail pages with strict rate limits.
- Save screenshots/HTML hashes on failures.
- Emit extracted records.
Rules:
- Do not bypass captchas.
- Do not circumvent access controls.
- Do not automate around forbidden 2FA flows.
- Use only authorized credentials.
- Respect vendor terms, crawl delays, and rate limits.
- Treat cookies/session state as secrets.
Normalization Rules
Product normalization must be deterministic:
- Normalize whitespace, punctuation, unicode fractions, and case.
- Preserve original vendor names in
raw_name. - Extract size patterns like
5mg,10 mg,2 x 5mg,10mg vial,30 mg kit. - Compute:
amount_mgvial_counttotal_mgprice_per_mg = price / total_mg
- Parse purity only from explicit values like
99%,>99%,99.5%. - Preserve uncertain purity as
purity_textand warn instead of inventing values. - Map salts only from explicit terms such as
acetate,tfa,hcl,base. - Normalize availability into
in_stock,out_of_stock,backorder,unknown. - Unknown peptide matches stay staged and never publish.
Review-First Publishing
V1 publishing policy: review required before public listing changes.
Review queue shows:
- Vendor/source/run.
- Raw vendor product name.
- Normalized name.
- Candidate peptide match.
- Parsed variants and prices.
- COA/product URLs.
- Warnings.
- Diff against existing public listing, if linked.
Reviewer actions:
- Approve product.
- Reject product.
- Assign peptide.
- Add vendor-specific alias rule.
- Link to existing
vendor_products. - Create new public product on publish.
- Approve specific variants.
- Reject specific variants.
Publishing behavior:
- Approved product creates or updates
vendor_products. - Approved variants create or update
vial_options. - Missing products are not immediately deactivated.
- A product or variant should only be marked inactive/unavailable after a configurable number of consecutive misses, default
3. - Every public-table mutation writes a
vendor_publish_eventsrow.
Admin UI Integration
Add a small internal admin surface after schema and worker basics exist.
Recommended route:
/admin/vendor-ingestion
Access control:
- Server-side only.
- Supabase-authenticated user required.
- Admin allowlist via
INGESTION_ADMIN_EMAILSenv var for v1. - Later replace with an
admin_usersor role table if needed.
Views:
- Sync source list.
- Run history.
- Run detail.
- Product review queue.
- Product diff preview.
- Publish event audit log.
Server actions:
approveIngestedProductrejectIngestedProductassignPeptideMatchcreateAliasRulelinkExistingVendorProductpublishApprovedRun
The client must not receive Vault secrets or service-role credentials.
Rollout Plan
Phase 1: Schema and type alignment
- Add Supabase migrations for ingestion tables, indexes, RLS, and service-role-only Vault RPCs.
- Regenerate
/Users/wells/Documents/GitHub/peptide-portal/lib/supabase/database.types.ts. - Confirm live Finnrick tables are represented locally so generated types match MCP reality.
Phase 2: Worker skeleton
- Create
workers/vendor-sync. - Add Bun scripts and TypeScript config.
- Add Supabase service-role client.
- Add run creation/update helpers.
- Add deterministic hashing and artifact naming.
- Add dry-run mode.
Phase 3: Manual/CSV MVP
- Build
manual-csvadapter for current vendor CSVs. - Stage rows into ingestion tables.
- Prove normalization and review flow without browser complexity.
Phase 4: Publish path
- Build publish planner that produces diffs without applying them.
- Build approved-only publisher.
- Verify
vendor_listingsreflects published rows only after approval.
Phase 5: Public HTTP adapters
- Add sitemap/JSON-LD/static HTML crawling through Crawlee HTTP/Cheerio flows.
- Add per-vendor config for selectors and URL patterns.
Phase 6: Authenticated portal adapter
- Add Playwright login/session handling.
- Store credentials and sessions in Vault.
- Run one gated vendor proof of concept.
- Capture screenshots and failure artifacts for login drift.
Phase 7: Admin UI
- Add internal review dashboard and server actions.
- Keep all writes server-only.
- Add audit visibility for publish events.
Testing Plan
Unit tests:
- Price parser.
- Size parser.
price_per_mgcalculation.- Peptide alias normalization.
- Deterministic match precedence.
- Product hash stability.
- Publish diff generation.
Fixture tests:
- CSV import fixture.
- Public HTML product grid fixture.
- Product detail page fixture.
- JSON-LD product fixture.
- Authenticated portal fixture using saved mock HTML, not live credentials.
Integration tests:
- Dry-run sync creates one
vendor_sync_runsrow. - Dry-run sync creates ingested products and variants.
- Dry-run sync does not mutate
vendors,vendor_products, orvial_options. - Approved publish creates or updates public catalog rows.
- Re-running the same source does not create duplicate public products.
- Missing products are not deactivated until the consecutive-miss threshold is reached.
Operational tests:
- Failed login marks run failed/partial with sanitized error.
- Expired session retries login once.
- Captcha/2FA gate stops the run without bypass.
- Secrets do not appear in logs.
- Worker can be run locally and in a container.
Compliance and Safety
- Use only authorized access.
- Do not bypass technical access controls.
- Do not scrape sources that prohibit the intended use.
- Add vendor-specific rate limits.
- Keep affiliate/commercial relationships auditable.
- Preserve source provenance for every published listing.
- Do not infer medical claims from scraped product pages.
- Legal review is recommended before publishing automated vendor-derived catalog data at scale.
Non-Goals
- No AI matching or AI extraction.
- No direct public writes from client components.
- No browser automation inside Next.js request handlers.
- No automatic full publish in v1.
- No captcha bypass.
- No replacing the current marketplace UX as part of ingestion v1.
References
- Supabase Vault: https://supabase.com/docs/guides/database/vault
- Bun docs: https://bun.sh/docs
- Crawlee docs: https://crawlee.dev/docs/introduction
- Playwright browser/Docker docs: https://playwright.dev/docs/docker
- Shopify Admin GraphQL API: https://shopify.dev/docs/api/admin-graphql
- WooCommerce REST API: https://developer.woocommerce.com/docs/apis/rest-api/
- Sitemaps protocol: https://www.sitemaps.org/en_GB/protocol.html
- Schema.org Product: https://schema.org/Product
Assumptions
- The plan will later be stored at
/Users/wells/Documents/GitHub/peptide-portal/docs/vendor-ingestion-architecture.md. - Bun is used for the ingestion worker, while the existing Next app remains pnpm-managed.
- Supabase Vault is the credential/session store.
- Publishing is review-first in v1.
- The first useful implementation target is CSV/manual ingestion, followed by one public HTTP adapter, then one authenticated Playwright adapter.