Skip to content

fix(harvest): retry transient network errors + secure deploy plan#3

Open
yashksaini-coder wants to merge 6 commits into
masterfrom
fix/harvest-retry
Open

fix(harvest): retry transient network errors + secure deploy plan#3
yashksaini-coder wants to merge 6 commits into
masterfrom
fix/harvest-retry

Conversation

@yashksaini-coder

Copy link
Copy Markdown
Owner

Summary

Two related changes:

  1. CI fix — wrap the top-level GitHub GraphQL harvest in p-retry so a transient ERR_STREAM_PREMATURE_CLOSE (or any network-class blip) doesn't kill the weekly workflow. This was the root cause of run 28319315389 failing on 2026-06-28.

  2. Plan docdocs/plans/secure-public-deploy.md covers the 6-phase migration to a Vercel-read + Render-write split with GitHub OAuth replacing the scrypt PIN. No code from the plan is in this PR — that lands in follow-up phased PRs.

Why the retry is scoped to the top-level call

enrichWithCommitDetails already routes through githubLimiter (which has its own pRetry) and is wrapped in best-effort try/catch — per-commit failures are non-fatal. But the top-level client.request had zero retry, so one TCP hiccup = whole pipeline dead. This PR closes that gap and leaves the rest of the call graph alone.

What the classifier retries vs. aborts

  • Retries: ECONNRESET, ETIMEDOUT, ENOTFOUND, EAI_AGAIN, ERR_STREAM_PREMATURE_CLOSE, UND_ERR_SOCKET, HTTP 5xx
  • Aborts via AbortError: HTTP 4xx (bad token, missing user), unknown codes, generic Error

Avoids burning the workflow's 10-minute budget retrying a hard auth failure.

Confirmed decisions for the secure-deploy plan (locked this session)

Decision Choice
Write host Keep Render, add Vercel read-only (pipeline can't move to serverless)
Owner auth GitHub OAuth + OWNER_GITHUB_ID check (no password to crack)
Public domain Free *.vercel.app URL for now
API token Keep separate DEVNOTION_API_TOKEN Bearer for scripts/CI
Vercel framework Next.js + ISR (revalidate on push to master)

Test plan

  • pnpm test — 78/78 pass (11 new in github-retry.test.ts)
  • pnpm lint — 0 errors (5 pre-existing any warnings, unrelated)
  • After merge: gh run rerun 28319315389 to verify the retry actually catches the flake on replay
  • After merge: watch the next scheduled Sunday dispatch to confirm no regressions

Follow-ups (separate PRs, in order per the plan doc)

  1. Phase 1 — Route reshape to /admin/* (mechanical, ~1h)
  2. Phase 2 — GitHub OAuth + OWNER_GITHUB_ID check (~2-3h)
  3. Phase 3 — HMAC-signed session cookie (~30min, deferrable)
  4. Phase 4 — CSRF + Origin check on mutations (~1h)
  5. Phase 5 — Next.js Vercel project (~2-4h)
  6. Phase 6 — Cutover + smoke test (~30min)

The top-level fetchWeeklyContributions had zero retry, so one TCP hiccup
during the response gzip-decode (ERR_STREAM_PREMATURE_CLOSE) killed the
whole weekly workflow — that's what knocked out run 28319315389.

Wrap client.request in p-retry (already a dep). Retry network-class codes
(ECONNRESET, ETIMEDOUT, ENOTFOUND, EAI_AGAIN, ERR_STREAM_PREMATURE_CLOSE,
UND_ERR_SOCKET) and HTTP 5xx. Abort via AbortError on 4xx so a bad token
doesn't burn the workflow's 10-min budget on three pointless retries.

3 attempts, exponential backoff (1s → 10s). Per-commit enrichment path
already retries via githubLimiter; this just protects the top.
6-phase plan to ship a public read-only blog viewer on Vercel while
keeping the long-running pipeline + owner-only admin on Render. Replaces
the scrypt PIN with GitHub OAuth + OWNER_GITHUB_ID check — no credential
exists to brute-force; GitHub's 2FA/passkey stack handles the auth.

Pipeline can't move to Vercel (60s function budget vs. ~30-90s narration
+ harvest + publish), so the split keeps the right host for each job.
@bolt-new-by-stackblitz

Copy link
Copy Markdown

Review PR in StackBlitz Codeflow Run & review this pull request in StackBlitz Codeflow.

Mirrors the Node 22 + pnpm setup used by weekly-dispatch.yml. Runs:
- pnpm lint
- pnpm exec tsc --noEmit  (catches type errors in non-test code)
- pnpm test

concurrency cancels in-flight runs on a rapid push so we don't queue
stale jobs. Read-only permissions (no need to write anything from CI).

Closes the gap where PRs got zero automated feedback — every PR from
here on gets a green/red signal before merge.
Per docs/plans/secure-public-deploy.md Phase 1: move every authenticated
route under /admin/* so the public/private boundary is purely a URL
prefix. Sets up Phase 2 (GitHub OAuth) to swap auth at one place without
touching route definitions.

Path changes:
- /runs    → /admin/runs
- /run     → /admin/run
- /preview → /admin/preview
- /login   → /admin/login
- /logout  → /admin/logout

Kept public: /, /health, /generated/*

Adds a regex-based 301 redirect middleware for the legacy paths (~1 week
grace period — remove after 2026-07-06). GET/HEAD only; POST callers
(Bearer scripts) need to update to the new paths explicitly.

Updates every internal link (forms, redirects, header nav, tokenGate),
safeInternalPath default, auth-middleware login redirect, and the
existing tests that reference the old paths. No auth-scheme changes
in this phase — scrypt + session cookie still gate /admin/*.

Verified: pnpm test (78/78), pnpm exec tsc --noEmit clean, pnpm lint
(0 errors), manual smoke confirmed legacy 301s and new /admin/* routes
respond correctly with auth disabled.
CI on PR #3 failed because validateEnv() runs at module import time and
calls process.exit(1) when required env vars are missing. CI has no
.env.local, so every test file that transitively imports config/env.ts
crashed at the import phase — vitest reported these as "(0 tests)"
rather than failures, which made the symptom easy to miss locally.

Same pattern bit narrator.agent.ts: it constructs a Google model at
module load via createGoogleModel → nextKey, which throws on empty
GOOGLE_API_KEYS.

Fix: when process.env.VITEST is set (auto by the runner), short-circuit
validation with stub values for the five required fields. Real env vars
from process.env override the stubs via spread order, so .env.local
values still take effect locally. The strict production path is
unchanged — failed validation still exits.

Verified: pnpm test passes 78/78 both with AND without .env.local
(simulating the CI runner). tsc clean, lint clean.
Adds passwordless owner authentication via GitHub OAuth + an owner-id
gate. Strictly additive — scrypt password stays as a fallback so the
existing Render deploy keeps working until OAuth env vars are wired up.

Flow:
  GET  /admin/auth/github           → mint random state, set state cookie,
                                       redirect to github.com authorize
  GET  /admin/auth/github/callback  → verify state (constant-time), exchange
                                       code for token, fetch /user, check
                                       id === OWNER_GITHUB_ID, mint session

New env vars (all four required together, or none):
  GITHUB_OAUTH_CLIENT_ID
  GITHUB_OAUTH_CLIENT_SECRET
  OWNER_GITHUB_ID         — numeric (gh api /user --jq .id)
  OAUTH_CALLBACK_URL      — must match the OAuth app's callback URL exactly

Cross-field check in validateEnv() refuses partial config. .env.example
and render.yaml updated with the new section. Login page now shows a
"Sign in with GitHub" button when OAuth is on; both schemes mint the
same in-memory session cookie via createSession().

Security:
- read:user scope only (just enough to fetch the user id)
- state cookie HttpOnly + Secure + SameSite=Lax + Path=/admin/auth + 5min TTL
- constant-time state compare via safeEqual()
- non-owner user id → 403 (credentials valid, just not authorized)
- no PKCE — server-side confidential client with a real client_secret
- token exchange + /user responses parsed via zod at the boundary
- token-exchange internals never leak to the browser

Tests: 9 new (87 total). Verified end-to-end via smoke:
- with no OAuth env: /admin/auth/* → 404, login page falls back to password
- with OAuth env: login shows GitHub button, /admin/auth/github → 302 to
  github.com, callback with bad state → 400.

Phase 3 (HMAC session cookie) and Phase 4 (CSRF) follow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant