
Jan 04, 2026 • 6 min read • Backend, Integrations, APIs, Reliability, SaaS
Most production systems depend on APIs they don’t control. This is what it actually takes to build reliability when your dependencies are unpredictable.
Modern software is built on APIs: Payments, Messaging, Auth, Analytics, CRMs, and Cloud services.
Very few products are truly standalone.
And yet, one reality becomes obvious the moment you ship to production:
External APIs are unreliable - even when they’re “enterprise-grade”.
Building reliable systems on top of them is one of the hardest backend problems - and one of the least talked about.
Most API documentation implies stability, but production tells a different story.
Expectation (The Myth) | Production Reality |
|---|---|
Clear request/response contracts | Requests timeout without explanation |
Defined rate limits | Rate limits change without notice |
Promised uptime | Errors appear with no actionable context |
Even good APIs fail - just not always in predictable ways.
The mistake isn’t trusting APIs. The mistake is designing systems that assume they won’t fail.
Treat external systems as permanently unreliable, even when they usually work.
Most developers build for the happy path. But building for reliability requires moving past the binary of success and failure. You have to stop asking what happens when things work, and start planning for the grey area of production.
The Happy Path Question | The Reliability Question |
|---|---|
“What happens when this API succeeds?” | “What happens when it partially fails?” |
“How fast is the response?” | “What happens when it responds late?” |
“Did the request go through?” | “What happens when it responds twice?” |
“Is the service up?” | “What happens when it never responds?” |
If those questions don’t have architectural answers, the system isn’t reliable - it’s lucky.
One of the first painful lessons in integrations is duplication. In a distributed system, you cannot guarantee that a request will only be sent once. Networks glitch, and when they do, the default behavior of almost every system is to try again.
Retries happen. Webhooks repeat. Clients resend requests. Without a strict idempotency strategy, your system is vulnerable to critical failures:
Scenario | The Failure (Without Idempotency) |
|---|---|
Payment Processing | Users get charged twice for the same order. |
Notification Service | Duplicate emails or SMS messages annoy the user. |
Resource Updates | State becomes inconsistent across services. |
Idempotency is not an optimization. It’s a requirement. If your system can’t safely process the same event twice, it’s fragile.
order_id) to ensure that even if a new request is generated, the underlying intent is recognized as a duplicate.Webhooks feel like a silver bullet for real-time updates. The mental model is simple: an external service tells you when something happens. Most developers start with a dangerous assumption:
“We’ll just listen for events and update our database.”
In production, that simplicity evaporates. Webhooks arrive out of order, late, or multiple times. Sometimes, they never arrive at all. If your system treats an event as an absolute command, one dropped or delayed packet can corrupt your entire state.
The Webhook Myth | The Reliable Reality |
|---|---|
A command to change state. | A 'hint' that something changed. |
Process it immediately. | Queue it, then process. |
Trust the payload. | Verify via API before acting. |
Webhooks are notifications, not commands. Design your handlers to be 'Verify-then-Act' systems.
To survive the chaos, decouple ingestion from processing. Acknowledge the webhook with a 200 OK immediately, push the event to a queue, and let a background worker fetch the current 'source of truth' via the API. This ensures you never act on stale or out-of-order data.
One of the most underrated reliability tools is the timeout. In an integrated system, waiting is contagious. If an external API hangs for 30 seconds, your application threads hang for 30 seconds. Do that enough times, and your entire service grinds to a halt.
Strategy | System Impact |
|---|---|
No/Long Timeouts | Blocked threads, backed-up queues, and cascading total failure. |
Aggressive Timeouts | Contained failures, faster feedback, and graceful degradation. |
A slow failure is often worse than a fast one. It consumes resources and hides the root cause behind a wall of hanging connections.
But a timeout alone isn't enough. It must be paired with a Circuit Breaker to stop the bleeding and a clear fallback path to keep the user experience alive while the dependency recovers.
Retries sound like a safety net, but in production, they are often a self-inflicted DDoS attack. When an external service struggles, thousands of your application instances retrying simultaneously will crush whatever remaining capacity that service has left. This is the Retry Storm.
The Wrong Way | The Reliable Way |
|---|---|
Immediate retries (tight loops). | Exponential backoff + Jitter. |
Retry everything (4xx and 5xx). | Only retry transient errors (503/504). |
Infinite retries. | Strict retry limits and Dead Letter Queues. |
Sometimes the most reliable thing you can do is stop trying. Not every failure should be hidden from the user or the system.
“Fail and surface the issue.”
If an API returns a 400 (Bad Request) or a 401 (Unauthorized), no amount of retrying will fix the problem. Recognizing the difference between a transient network glitch and a permanent logic error is the difference between a resilient system and a broken one.
You don’t control external APIs, but you do control how well you understand their failures. Without deep visibility, debugging a failed payment or a dropped webhook becomes a game of finger-pointing between your team and the provider.
The Blind Spot | The Observable Solution |
|---|---|
Did the API fail or was it our network? | Structured logs with egress timing. |
Which user request caused this error? | Correlation IDs across boundaries. |
Is this a one-off or a provider outage? | Error categorization and dashboards. |
If an integration fails and you can’t answer What, Why, and How Often immediately, your reliability strategy is just guesswork.
Observability is not a luxury - it’s the only way to debug systems you don’t own. By investing in request tracing and clear error classification, you stop treating external APIs as black boxes and start treating them as measurable components of your infrastructure.
In a laboratory environment, systems should always be correct. But in production, reliability is often a measure of usability under pressure. When a dependency fails, your system shouldn't just roll over and die; it should shrink its footprint to protect the core experience.
Scenario | Graceful Degradation Path |
|---|---|
Price/Stock API is down | Show stale cached data with a 'Last updated' hint. |
Analytics provider is lagging | Fire-and-forget or delay ingestion; don't block the UI. |
Secondary feature failure | Hide the widget entirely while keeping the checkout flow alive. |
Users tolerate partial failure far better than a blank screen. Graceful degradation keeps trust intact by keeping the system usable.
This approach requires you to categorize your features into 'Critical' and 'Non-critical'. If your auth service is up, but your avatar-generation API is down, the user should still be able to log in. Reliability is the art of failing in a way that the user barely notices.
One of the hardest truths in backend engineering is that technical excellence has a price tag. Reliability is not an accidental byproduct of good code; it is a deliberate product choice.
Engineering can build a system that never fails, but Product might not be able to afford the complexity or the bill.
The Reliability Feature | The Trade-off |
|---|---|
Aggressive Retries | Increased infrastructure costs and potential for 'retry storms'. |
Redundancy / Multi-Region | Significant architectural complexity and maintenance overhead. |
Complex Fallbacks | More design and QA effort to ensure UX stays consistent. |
Instead of assuming 100% uptime, teams must decide: What failures are acceptable? What latency is tolerable? What guarantees actually matter to our users?
Backend engineers are the ones who implement these safeguards, but the best outcomes happen when reliability is discussed explicitly at the product level. When everyone understands the cost of 'perfection', you can build a system that is robust enough for reality without being over-engineered for theory.
Building reliable systems on top of unreliable APIs isn’t about perfection.
It’s about:
Reliability emerges when systems are designed for reality - not for ideal conditions. Most production issues aren’t caused by bad code; they’re caused by optimistic assumptions.
If you are integrating with external APIs, assume they will fail. Once you move past the 'Happy Path' and start designing for the gray areas - the timeouts, the duplicates, and the partial outages - you stop building lucky systems and start building resilient ones.
Build something that survives the chaos.
Thanks for reading.
Written by Sanket Dofe
Full-stack engineer & system architect. I build scalable products and write about engineering clarity.
How did this piece land for you? React or drop your thoughts below.
Recommended articles & engineering write-ups.
Feb 01, 2026 • 10 min • Backend, Observability, Engineering, Production, Reliability
Observability isn’t something you buy or plug in. It’s a way of thinking about systems that reflects how engineers design, reason, and take ownership.
Jan 25, 2026 • 8 min • Backend, Debugging, Reliability, Production, Engineering
Debuggability determines how fast teams recover when things break. In real systems, it’s not optional—it’s a core product feature.
Jan 18, 2026 • 6 min • Engineering, Ownership, Backend, Production, Career
Ownership in engineering isn’t a title or a responsibility on paper. It’s a mindset that production forces on you - usually when something breaks.
Jan 11, 2026 • 7 min • Architecture, System Design, Clean Architecture, Engineering
An exhaustive analysis of why strict architectural patterns struggle under production pressure, performance requirements, and the need for team velocity.