← Back to work
Case study · 02 Solo build Agentic AI · PM tooling Live · spec-check-nine.vercel.app

Spec‑Check

An AI review system that stress-tests a PRD across five specialist agents — before a line of code gets written.

Spec-Check product shot
Role
Solo — PM & builder
Stack
Next.js · Claude · OpenAI
Agents
5 specialists, 1 orchestrator
Status
Live beta

01 / Why I built thisEvery bad sprint I've shipped started with a spec I thought was done.

After seven years at Amazon writing PRDs for Trust & Safety, I got tired of the same pattern: engineering reads the spec, nods, builds it, and halfway through discovers a missing metric, a mis-scoped edge case, or a stakeholder we forgot to loop in. Costly. Avoidable. Structural.

The best PM tradition for catching this is the "red-team review" — pull in a peer who'll try to break your spec before engineering does. The problem: peers are expensive, slow, and often too polite.

Agents are neither polite nor tired. They're the ideal red-teamer for a pre-flight spec check.

02 / The systemFive specialists, one orchestrator, one readiness score.

Spec-Check runs your PRD through five parallel agents, each with a single sharp mandate. The orchestrator collects their findings, deduplicates, ranks severity, and returns a unified Readiness Score from 0–100 along with actionable gaps.

03 / A quick demo

Here's the core output pattern — radar across the five axes, a composite Readiness Score, and the top issues ranked by severity. Click a scenario to see how the chart responds.

TECHNICAL FEASIBILITY STAKEHOLDER EDGE CASES DATA PRIVACY METRIC CLARITY 62 READINESS

04 / Design decisions

Specialist > generalist. I initially tried a single agent with a long rubric. It produced mush — every axis scored 60-ish, no sharp findings. Splitting into five single-mandate agents dropped variance and surfaced findings sharp enough to act on. The orchestrator became thinner, not thicker.

A score, not a verdict. Readiness Score is a diagnostic, not a gatekeeper. It's ambiguous by design — 62% means "go, but with these three things fixed first." A binary pass/fail would have been easier to build and worse to use.

Severity bands, not stars. Findings ship in four bands — Mental Model, Metric Gap, Stakeholder Alert, Edge Case — because that's how engineers triage. The findings got more useful when I stopped being clever with severity names.

05 / Outcomes

5
Specialist agents, 1 orchestrator
~45s
Median review latency for a 4-page PRD
100%
Of my own PRDs now routed through it
0→1
Built solo, live on Vercel

Using Spec-Check on my own work caught three shippable-looking PRDs with critical gaps before they ever hit engineering — one of which would have cost a week of rework.

06 / What this project taught me