GPT-4 can write a user story in three seconds. But is it any good? We ran 200 AI-generated stories through a panel of senior product managers and tracked what they changed, approved, and rejected. The results are more nuanced than the hype suggests.
We gave the same set of 40 client briefs to two groups:
Each group produced user stories, acceptance criteria, and complexity estimates. Then we swapped the outputs — Group A reviewed the AI output, Group B's stories were reviewed by a fresh PM.
Here's what we found.
Speed: The AI produced all 40 stories in under 4 minutes. The PM panel took a combined 14 hours across all three people.
Consistency: AI stories followed the same format every single time. PM-written stories varied significantly in structure, level of detail, and acceptance criteria quality — even within the same person's output across a day.
Completeness: AI stories included acceptance criteria in 100% of cases. PM-written stories included them in 71%. This matters more than it sounds — stories without AC create ambiguity downstream.
Ambiguity detection: The AI flagged potential ambiguities (via an `isAmbiguous` field and a clarifying question) in 34% of stories. The PM panel flagged similar issues in 41%. Close enough to be useful.
Domain context: The AI had no knowledge of the client's existing system, tech stack, or past decisions. Several stories were technically correct but strategically wrong — they described features that already existed in a different form, or that conflicted with known constraints.
Prioritisation: AI cannot tell you whether this story matters relative to everything else in the backlog. It can estimate complexity but not value.
Nuance in edge cases: For highly technical or domain-specific briefs (e.g., a financial compliance feature), AI stories were noticeably shallower. They described the what correctly but missed important how and why details.
Confidence calibration: The AI flagged 34% of stories as ambiguous. But our PM reviewers found that roughly 58% needed some clarification. The AI was systematically under-flagging.
When reviewing AI output, PMs made the following edits:
| Change type | Frequency |
|---|---|
| Rewrote acceptance criteria | 38% |
| Added missing context | 29% |
| Changed complexity estimate | 22% |
| Rewrote user story entirely | 11% |
| Approved with no changes | 31% |
That 31% is the number that stands out. Nearly a third of AI-generated stories were approved as-is by experienced PMs. That's not a replacement for product thinking — but it's a significant reduction in the lowest-value work.
The debate of "AI vs. human" for user stories is a false one. The better question is: what's the right division of labour?
AI is good at:
Humans are good at:
A PM who uses AI to generate first drafts and then applies their expertise to review and refine is more productive than one who does either alone.
Most PM tools still treat story writing as a manual activity. You open a Jira ticket, type into a blank field, hope you remember the acceptance criteria format.
A small number of tools — Storygate included — are starting to treat the intake → story → backlog pipeline as something worth automating. The shift is early but the direction is clear.
The teams that adapt fastest won't be the ones with the best AI prompts. They'll be the ones who redesign their process around what AI is actually good at.
Try AI-generated user stories in your own workflow. [Start with Storygate for free →](/register)
Try Storygate free
No credit card. Set up in 2 minutes.