MDX Limo
Onboarding Text Verification Stabilization

Onboarding Text Verification Stabilization

Summary

  • The current onboarding flow is a code-claim flow, not a standard OTP flow: POST /api/phone/verify/initiate creates a CONSUL-XXXXXX code in phone_verification_codes with a 15-minute TTL, the app opens a text deeplink, inbound messaging calls POST /api/phone/verify/claim, and a successful claim sets profiles.phone_number, profiles.phone_verified = true, and advances onboarding.
  • Production impact is real and measurable: public.profiles currently has 22 users with onboarding_completed = false or phone_verified = false, 7 users in awaiting_phone_verification, and 6 expired unconsumed onboarding codes. Chloe is in that cohort with phone_verified = false, onboarding_state.status = awaiting_phone_verification, no phone number, and an expired code.
  • Confirmed / strongly supported failure points:
    • The onboarding API uses an env-backed number, but other UI surfaces still hard-code a different phone number, so env fixes can still leave stale user-visible numbers.
    • Expired and invalid code claims are collapsed into the same generic error.
    • Unknown or malformed inbound verification texts do not get an instructional reply and can fall through prospect routing.
    • The current sms:${number}&body=${code} deeplink is fragile and should not be relied on for body prefill.
  • Execute in this order: config hardening, verification lifecycle fixes, inbound fallback handling, stuck-user recovery, then SMS OTP evaluation.

Implementation Changes

1. Canonicalize the Consul phone number

  • Add a single shared server-side helper for the verification number and use it everywhere user-facing.
  • Make CONSUL_VERIFY_NUMBER the canonical env var for onboarding. Keep PHOTON_NUMBER and CONSUL_IMESSAGE_NUMBER as temporary aliases for one release, but log an error if multiple values are set and differ.
  • Remove hard-coded number usage from authenticated UI surfaces and route all displayed/copied numbers through the canonical helper or server-provided data.
  • During execution, verify the live Vercel project generative-inc/consul-agent has the same number in production, preview, and development. Treat the local .vercel/project.json as stale for this audit.

2. Fix the onboarding verification lifecycle

  • Keep the current 15-minute TTL for the hotfix.
  • Update POST /api/phone/verify/initiate so that every onboarding restart:
    • expires any prior unconsumed onboarding codes for the current user,
    • creates one fresh code,
    • sets onboarding state to step = 2, status = awaiting_phone_verification,
    • returns { code, consulNumber, deepLink, expiresAt }.
  • Change the deeplink strategy to recipient-only launch. Do not depend on body prefill. The UI should always display the exact CONSUL-XXXXXX code and the destination number with explicit copy/manual-send instructions.
  • Update GET /api/phone/verify/status to return a structured state for the current user: idle, pending, expired, or verified, plus phoneNumber and expiresAt when relevant.
  • Update POST /api/phone/verify/claim to return structured failure reasons instead of only "Invalid or expired verification code". Use at least: expired, invalid, already_consumed, phone_already_bound, and verified.

3. Add user-facing fallback handling

  • In the onboarding step, replace the current generic timeout UX with explicit expired-state handling. When polling sees expired, stop polling, show “Code expired after 15 minutes,” and expose a Get new code action that calls initiate again.
  • Keep step 2 as the phone gate. On successful verification, continue advancing to step 3 exactly as today.
  • Update product copy to match the actual format. The app should tell users to send the exact CONSUL-XXXXXX code, not a bare 6-digit code.
  • Improve the step 2 fallback UI so it works even if the deeplink does nothing: visible number, visible code, copy actions, retry/regenerate action, and no reliance on the messaging app opening successfully.

4. Handle malformed and stale inbound texts

  • In the messaging resolver, keep claim attempt first for code-like messages from unknown numbers.
  • If claim fails with expired, send an instructional reply instead of falling through: tell the user to return to the app and generate a new code.
  • If the message looks like a verification attempt but is malformed or invalid, send an instructional reply telling the user to send the exact CONSUL-XXXXXX code from the app.
  • Preserve existing prospect routing only for non-onboarding unknown-number conversations. Do not send prospect replies for short code-like or verification-like messages.

5. Recover the currently stuck users

  • Use public.profiles as the source of truth, not an app-level users table.
  • Add an admin recovery script or runbook that can:
    • list stuck users with email, name, onboarding state, phone verification state, and latest code status,
    • expire any unconsumed onboarding codes for selected users,
    • reset selected users to step = 2, status = awaiting_phone_verification,
    • leave phone_verified = false,
    • leave any existing phone_number untouched for this first recovery pass.
  • Run that recovery path for the 7 users currently in awaiting_phone_verification, including Chloe, after the hotfix is deployed.
  • Treat the separate issue of unverified numbers being written outside the verification flow as follow-up hardening, not the blocker for the immediate fix.

6. Evaluate SMS OTP as the long-term replacement

  • Mark this as a product decision, not part of the hotfix branch.
  • Produce a short implementation proposal for a standard SMS OTP flow:
    • user enters phone number,
    • backend sends OTP through a provider,
    • user enters OTP in-app,
    • backend verifies and sets profiles.phone_verified = true.
  • Compare tradeoffs explicitly: per-SMS cost and provider dependency versus much higher reliability and no dependence on deeplinks or messaging-app behavior.

Test Plan

  • API tests:
    • onboarding initiate expires prior active codes and returns a fresh expiresAt,
    • status transitions idle -> pending -> verified and pending -> expired,
    • claim returns each structured failure reason and updates profiles/onboarding on success.
  • Messaging tests:
    • successful claim from an unknown phone verifies and short-circuits routing,
    • expired and malformed verification texts get instructional replies,
    • ordinary unknown-number messages still follow prospect behavior.
  • UI tests:
    • step 2 shows code and number after initiation,
    • expired status surfaces a regenerate action,
    • successful verification advances onboarding from step 2 to step 3.
  • Live smoke tests:
    • one preview and one production end-to-end run with a real device and the real Console number,
    • confirm inbound processing, phone_verification_codes.consumed_at, and profiles.phone_verified = true,
    • run the recovery script on one known-stuck user and verify the regenerated code path works.

Assumptions and Defaults

  • Keep the current 15-minute onboarding TTL for now.
  • Keep the current CONSUL-XXXXXX code format for the hotfix.
  • Use the live Vercel project generative-inc/consul-agent as the environment source of truth; the repo-local Vercel link is not the live project for this flow.
  • Direct env confirmation in Vercel still requires authenticated access during execution.
  • [Product decision] Standard SMS OTP is a follow-up proposal, not part of the immediate unblock.