How to Write an AI Feature Spec That Engineers Won't Push Back On

Wednesday 13/05/2026

·11 min read

You write a one-pager for the new AI feature. "When the user clicks Summarize, the AI generates a 3-paragraph summary of the document." You drop it in the engineering channel. Within an hour you have eleven questions back: What model? What if the doc is 500 pages? What if it's empty? What about non-English? What's our latency budget? What happens when Anthropic's API is down? What counts as a "good" summary? Are we A/B testing this? Who pays the token bill? How do we handle PII?

Every one of those questions is fair. Most of them you can answer - you just didn't think to write them down. An AI feature spec, or product requirements document, isn't harder to write than a regular PRD. It just has more required fields. This post is a template you can copy, plus a real before/after showing what engineers wish PMs would include before kicking off the project.

Why your AI feature spec keeps getting bounced back

Regular product specs answer one question: what is the user-visible behavior. An AI feature spec has to answer that plus four more:

What does "good enough" mean? Output quality is fuzzy. Without a definition, engineering will optimize for the wrong thing.
What happens when the AI fails? API outages, rate limits, refusals, hallucinations. Every AI feature has a 1-5% failure rate. You need to design for it.
What does this cost per use? Token costs are real and they scale with usage. A 30¢-per-call feature on a free tier kills your unit economics.
How will we know it's working in production? AI features drift. The same prompt with the same model can degrade after a model update. You need eval coverage.

Engineers push back on AI specs not because they're being difficult but because shipping without these answers means they will own the decisions you didn't make. And they don't have your context for the business tradeoffs.

The eight sections every AI feature spec needs

Steal this structure. Every section is required. If you can't fill one in, that's the conversation to have with engineering before a sprint starts.

1. Success metrics with thresholds

Not "improve user satisfaction." Specific numbers, with how you'll measure them.

Success metrics:
- Task completion rate: ≥ 70% of users keep the AI output without editing it
- User satisfaction: ≥ 4/5 thumbs-up rate on the in-product feedback widget
- Repeat usage: ≥ 40% of users who tried it use it again within 7 days
- Cost per successful interaction: < $0.05

Kill criteria (if hit at week 4 post-launch):
- Task completion rate < 50%
- User satisfaction < 3.5/5
- Cost per successful interaction > $0.20

If you don't define kill criteria upfront, the feature becomes politically un-killable. Set them now.

If you're not sure what numbers to pick, I wrote a whole post on how to measure if your AI feature is actually working with the four metrics that matter and instrumentation code.

2. Fallback behavior

What happens when the AI call fails? "Show an error" is not enough. There are at least four failure modes you need to handle separately:

Failure modes:
- API timeout (no response in 30s) → show "Taking longer than usual" + retry button
- API error (500, 503) → fall back to non-AI version of the feature (search instead of AI summary)
- Content refusal (model refuses to answer) → show neutral message + option to rephrase
- Rate limit (429) → queue the request, show "We're a bit busy" + ETA

Engineers will write some kind of fallback regardless. You want them writing the one you designed, not the one they made up under deadline pressure.

3. Latency and cost budgets

Two numbers. Without them, engineering has no way to choose between model options or to know when to optimize.

Latency budget:
- P50 time-to-first-token: < 800ms
- P95 total response time: < 6s
- Hard timeout: 30s

Cost budget:
- Average cost per interaction: ≤ $0.03
- Hard ceiling per interaction: $0.10 (above this, fail closed)
- Monthly budget for the feature: $500 (at projected 50k uses/month)

If your latency budget is 800ms but you're using a model with a 3-second cold start, that's a constraint conversation before sprint planning, not after.

4. Edge cases

This is the section that catches the most pushback. Be specific:

Edge cases:
- Empty input → button is disabled with tooltip "Add content first"
- Input over 100k tokens → show truncation warning, summarize first 100k
- Non-English input → detect language, use multilingual prompt, log for analysis
- Adversarial prompts (prompt injection attempts) → strip system-prompt-like patterns, log for security review
- User-controlled input embedded in system context → use delimiter pattern, never trust verbatim
- Repeated rapid clicks → debounce 500ms, show last result
- User loses connection mid-stream → save partial output, resume on reconnect

You don't need to specify the implementation of each - you need to specify the behavior the user sees. Engineers will figure out the how.

5. Eval set

This is the section most specs skip and shouldn't. Provide 15-30 example inputs with expected outcomes (not exact outputs - outcomes).

Eval cases (full set in eval/summary-v1.yaml):
- Short article (500 words) → 2-3 sentence summary, covers all main points
- Long article (10k words) → 3-paragraph summary, prioritizes intro and conclusion
- Technical paper with formulas → preserves key terms, doesn't fabricate equations
- News article with quotes → attributes quotes to correct sources
- Opinion piece → marks it as opinion, doesn't restate as fact
- Empty/garbage input → returns "Not enough content to summarize"
- Prompt injection ("Ignore previous instructions...") → produces a summary anyway

This becomes the regression test set engineers use to validate the prompt. If you provide it, you control the definition of "working." If you don't, the engineer's vibes become the definition.

6. Failure modes and observability

What needs to be logged so we know what's happening in production?

Observability requirements:
- Every call logs: input length, output length, latency, model, cost, user ID (hashed)
- Failures log: error type, input hash, retry count
- Sample 1% of successful calls for quality review (input + output)
- Daily dashboard: success rate, P95 latency, cost trend, top error types
- Alert: success rate drops below 90% for 1 hour
- Alert: P95 latency exceeds budget for 30 minutes

If you don't specify this, you'll be flying blind a week after launch. Tools like Langfuse or Helicone make this easy if you're using TypeScript (here's how to wire Langfuse into a TypeScript AI app).

7. Data scope and privacy

This is the one legal will eventually ask about. Better to answer it in the spec.

Data scope:
- Input data: user-uploaded documents, max 100k tokens per request
- Sent to provider (Anthropic): document content, no PII fields
- Retention: zero retention agreement with Anthropic (no training, 0-day retention)
- PII handling: redact emails/phone numbers via middleware before sending
- User opt-out: account setting "Don't use my data for AI features" disables the feature entirely
- Audit log: every call logged for 90 days for compliance review

8. Rollout plan

How does this get to 100% of users without breaking everything?

Rollout:
- Week 1: Internal users only, gather eval feedback
- Week 2: 5% of free-tier users, A/B against control (no feature)
- Week 3: 25% of all users if metrics hold
- Week 4: 100% if no regressions in success rate or cost
- Kill switch: feature flag, removable in < 5 minutes

A real before/after: the "AI summary" spec

Here's a redacted version of an actual spec I helped rewrite for an AI summary feature. The first version got bounced back twice.

The "before" spec (bad)

Feature: AI Summary
What: Add a "Summarize" button to documents.
When clicked, AI generates a 3-paragraph summary.
Use the Anthropic API.

Acceptance criteria:
- Button appears at the top of every document
- Clicking it shows a summary
- Summary is accurate

Open questions:
- TBD

The engineer's response (paraphrased):

"Accurate by what definition? What if the doc is empty? What if the user has 500 of them? What model - Haiku, Sonnet, Opus? They're 4x different in cost. What's our latency target? What happens if Anthropic goes down - does the button disappear or show an error? Who pays for it? Can a free user use it? How do we test it before we ship it? How do we know if it's working after we ship it?"

Eleven questions. The PM thought the engineer was being a pain. The engineer was trying to figure out what to actually build.

The "after" spec (good)

Here's a sketch of the rewrite - abbreviated for the post. Same feature, far less ambiguity:

Feature: AI Document Summary

User-visible behavior:
- "Summarize" button on every document > 500 words
- Click → modal opens → summary streams in
- "Copy", "Regenerate", and feedback (thumbs up/down) buttons
- If document < 500 words, button is disabled with tooltip
- If document > 100k tokens, summarize first 100k, show truncation banner

Success metrics: [see template above]
Fallback behavior: [see template above]
Latency budget: P95 < 6s, P50 first token < 800ms
Cost budget: < $0.03 avg, $0.10 ceiling, $500/month
Edge cases: [10 enumerated cases]
Eval set: 25 cases in eval/summary-v1.yaml
Observability: [logging + dashboard requirements]
Data scope: zero retention, no PII sent
Rollout: 4-week phased, with kill switch

Model selection: Claude Haiku 4.5 (within budget for 95% of docs)
Tier access: paid users only initially; free users see upgrade prompt

Open questions for engineering:
- Do we cache identical inputs? (default: yes, 7-day TTL)
- Do we stream or wait for full response? (default: stream)

The engineering team estimated this in 90 minutes. The previous version generated three meetings.

The template you can copy

Save this as .spec.md for any AI feature:

# AI Feature Spec: [Feature Name]

## User-visible behavior
[What the user sees, step by step. Include UI states, error states, empty states.]

## Success metrics
- Primary: [metric] ≥ [threshold]
- Secondary: [metric] ≥ [threshold]
- Kill criteria: [metric] < [threshold] at [time post-launch]

## Fallback behavior
- API timeout (>30s): [behavior]
- API error: [behavior]
- Content refusal: [behavior]
- Rate limit: [behavior]

## Latency and cost budgets
- P50 latency: < [ms]
- P95 latency: < [ms]
- Avg cost per call: < $[amount]
- Hard ceiling per call: $[amount]
- Monthly budget: $[amount]

## Edge cases
- Empty input: [behavior]
- Oversize input: [behavior]
- Non-English input: [behavior]
- Adversarial input: [behavior]
- [Other domain-specific cases]

## Eval set
Location: `eval/[feature]-v1.yaml`
Coverage: [N] cases across happy path, edge cases, adversarial.

## Observability
- Log per call: [fields]
- Sample for quality review: [%]
- Dashboard metrics: [list]
- Alerts: [conditions]

## Data scope and privacy
- Data sent to provider: [scope]
- Retention agreement: [details]
- PII handling: [approach]
- User opt-out: [behavior]

## Rollout plan
- Phase 1: [audience, duration, success criteria]
- Phase 2: [audience, duration, success criteria]
- Phase 3: [full rollout conditions]
- Kill switch: [how, by whom]

## Open questions for engineering
[List the decisions you want engineering input on, with your default suggestion]

A real instance of this template runs 2-3 pages. That's not bloat - every section saves a meeting.

What engineers actually wish PMs would include

I asked five engineers who build AI features what they wish appeared in specs. Three things came up every time:

A real eval set, not "it should be good." The eval set is the contract. Without it, "good" is whatever the loudest stakeholder thinks at the moment.
The cost number. Not "be cost-efficient" - the actual dollar budget per call. This determines which model to use, whether to cache, whether to batch.
The kill criteria. Engineers want to know how to win, and how to walk away. Specs without exit criteria turn into zombie features.

If your spec hits all eight sections above, you'll get fewer questions, faster estimates, and feature work that ships closer to what you actually wanted.

What's next

Once you've got the spec right, the next sharp edge is what happens when adversarial users find your feature. The next post in this series is prompt injection defense for JavaScript apps - covering the edge case section of the spec template above with real defense middleware you can drop into a Next.js app.

For features where the agent does multi-step work and a human should be in the loop at specific moments (approving an action, reviewing generated text before send), the spec's fallback behavior and human approval sections deserve their own thinking. I cover the pattern in Build a human-in-the-loop AI agent with Vercel AI SDK - pairs naturally with the spec template above.