
We Deployed a "Small Fix" and Took Down Production — Here's What Actually Happened
A minor backend change caused a production outage, high CPU usage, and API failures. Here's how it happened, what we missed, and how we fixed it.
It started as a simple task.
Just add one more field to the API response.
No major logic change. No risky deployment. Just a small enhancement. We deployed it to production and within minutes:
API response time jumped from 120ms to 5s
CPU usage hit 95%
Some endpoints started timing out
Users began reporting failures
At first, nothing made sense.
What Changed?
Here's the actual change:
// Before
const users = await User.find({ isActive: true });
// After
const users = await User.find({ isActive: true })
.populate("orders");Looks harmless, right? That .populate("orders") was the killer.
The Real Problem
Each user had multiple orders.
So instead of: 1 query
We now had: 1 query + N additional queries (for each user)
This is called:
N+1 Query Problem
With approximately 2,000 users: That turned into 2,001 database queries per request
Why It Broke Production
MongoDB connections got saturated
CPU usage spiked due to excessive queries
API latency exploded
Node.js event loop got blocked
Even worse: This endpoint was used in the dashboard, and every page load triggered this heavy query
Why We Didn't Catch It
Local data was small (10-20 users)
No load testing
No query monitoring in staging
No performance checks before deploy
Everything worked fine locally.
The Fix
We replaced .populate() with a controlled query:
const users = await User.find({ isActive: true }).lean();
const userIds = users.map(u => u._id);
const orders = await Order.find({
userId: { $in: userIds }
}).lean();
const ordersMap = orders.reduce((acc, order) => {
acc[order.userId] = acc[order.userId] || [];
acc[order.userId].push(order);
return acc;
}, {});
const result = users.map(user => ({
...user,
orders: ordersMap[user._id] || []
}));Result After Fix
API response time: 5s to 180ms
DB queries: 2000+ to 2 queries
CPU usage normalized
System stable again
Lessons Learned
1. Never trust .populate() blindly
It looks simple but can be expensive at scale.
2. Always think in queries
Ask yourself: How many DB calls will this line generate?
3. Test with realistic data
Your local environment lies.
4. Add performance monitoring
Track: query count, response time, and CPU usage
5. Use .lean() when possible
It reduces memory overhead and improves performance.
What I Should Have Done
Looking back, this wasn’t just a “small fix gone wrong” — it was a gap in process.
Here’s what would have prevented the issue:
1. Proper Code Review (Even for Small Changes)
Even small fixes deserve a second pair of eyes.
A quick review could have caught the hidden impact early.
2. Use Feature Flags Instead of Direct Changes
Instead of pushing changes directly to production, wrapping them behind a feature flag would allow safe testing and instant rollback.
3. Canary Deployment
Rolling out the change to a small percentage of users first would have exposed the issue without affecting everyone.
4. Monitoring & Alerts
Better monitoring (logs, metrics, alerts) could have helped detect the issue immediately instead of after impact.
5. Don’t Trust “Small Data” Assumptions
Just because the change involves limited or local data doesn’t mean the impact is limited.
Production systems behave differently under real conditions.
Lesson: There’s no such thing as a “safe small fix” in production.
Bonus: Safer Alternative Pattern
For large datasets: Use aggregation pipelines, use pagination, limit populated fields, and cache frequently used data
Final Thought
Most production outages don't come from big changes. They come from small changes that scale badly.
Enjoying this article?
Get new articles, tips, and fixes delivered straight to your inbox — free, no spam.
Was this article helpful?
Let me know if this was useful — it helps me write more content like this.
Related Articles
You might also enjoy these
Why Your Docker Container Is 1.2GB When It Should Be 80MB
You run docker images and see your Node.js API sitting at 1.2GB. The same five mistakes appear in every bloated Docker image. Here's what they are and the exact changes that took a real 1.24GB image to 78MB without touching a single line of application code.
We Ran git rebase on a Shared Branch and Lost Three Days of Work
It was a Thursday afternoon. One git rebase followed by git push --force wiped three days of work across four developers. Here's exactly what happened, how we recovered every commit, and the rules we put in place so it never happens again.


Comments
Leave a Comment
All comments are reviewed before publishing