Building Reliable Systems on Top of Unreliable APIs

Jan 04, 2026 • 6 min read • Backend, Integrations, APIs, Reliability, SaaS

Most production systems depend on APIs they don’t control. This is what it actually takes to build reliability when your dependencies are unpredictable.

Share this article

LinkedIn Tweet

Modern software is built on APIs: Payments, Messaging, Auth, Analytics, CRMs, and Cloud services.

Very few products are truly standalone.

And yet, one reality becomes obvious the moment you ship to production:

The Reality Check

External APIs are unreliable - even when they’re “enterprise-grade”.

Building reliable systems on top of them is one of the hardest backend problems - and one of the least talked about.

The Myth of the “Reliable API”

Most API documentation implies stability, but production tells a different story.

Expectation (The Myth)	Production Reality
Clear request/response contracts	Requests timeout without explanation
Defined rate limits	Rate limits change without notice
Promised uptime	Errors appear with no actionable context

Even good APIs fail - just not always in predictable ways.

The mistake isn’t trusting APIs. The mistake is designing systems that assume they won’t fail.

Reliability Starts With Expecting Failure

The Mindset Shift

Treat external systems as permanently unreliable, even when they usually work.

Most developers build for the happy path. But building for reliability requires moving past the binary of success and failure. You have to stop asking what happens when things work, and start planning for the grey area of production.

The Happy Path Question	The Reliability Question
“What happens when this API succeeds?”	“What happens when it partially fails?”
“How fast is the response?”	“What happens when it responds late?”
“Did the request go through?”	“What happens when it responds twice?”
“Is the service up?”	“What happens when it never responds?”

If those questions don’t have architectural answers, the system isn’t reliable - it’s lucky.

Idempotency Is Not Optional

One of the first painful lessons in integrations is duplication. In a distributed system, you cannot guarantee that a request will only be sent once. Networks glitch, and when they do, the default behavior of almost every system is to try again.

Retries happen. Webhooks repeat. Clients resend requests. Without a strict idempotency strategy, your system is vulnerable to critical failures:

Scenario	The Failure (Without Idempotency)
Payment Processing	Users get charged twice for the same order.
Notification Service	Duplicate emails or SMS messages annoy the user.
Resource Updates	State becomes inconsistent across services.

A Hard Requirement

Idempotency is not an optimization. It’s a requirement. If your system can’t safely process the same event twice, it’s fragile.

Practical Lessons for Implementation

Idempotency must be enforced at the business logic level, not just the HTTP transport level. Checking for a duplicate ID in your database is more reliable than relying on a gateway.
Keys should represent intent, not just request IDs. Use a unique identifier for the specific action (like order_id) to ensure that even if a new request is generated, the underlying intent is recognized as a duplicate.
Assume “at least once” delivery everywhere. Design your consumers to be naturally resilient to duplicates by checking the state of the resource before modifying it.

Webhooks: The Illusion of Order

Webhooks feel like a silver bullet for real-time updates. The mental model is simple: an external service tells you when something happens. Most developers start with a dangerous assumption:

“We’ll just listen for events and update our database.”

In production, that simplicity evaporates. Webhooks arrive out of order, late, or multiple times. Sometimes, they never arrive at all. If your system treats an event as an absolute command, one dropped or delayed packet can corrupt your entire state.

The Webhook Myth	The Reliable Reality
A command to change state.	A 'hint' that something changed.
Process it immediately.	Queue it, then process.
Trust the payload.	Verify via API before acting.

The Golden Rule

Webhooks are notifications, not commands. Design your handlers to be 'Verify-then-Act' systems.

To survive the chaos, decouple ingestion from processing. Acknowledge the webhook with a 200 OK immediately, push the event to a queue, and let a background worker fetch the current 'source of truth' via the API. This ensures you never act on stale or out-of-order data.

Timeouts Are a Design Decision

One of the most underrated reliability tools is the timeout. In an integrated system, waiting is contagious. If an external API hangs for 30 seconds, your application threads hang for 30 seconds. Do that enough times, and your entire service grinds to a halt.

Strategy	System Impact
No/Long Timeouts	Blocked threads, backed-up queues, and cascading total failure.
Aggressive Timeouts	Contained failures, faster feedback, and graceful degradation.

The Latency Rule

A slow failure is often worse than a fast one. It consumes resources and hides the root cause behind a wall of hanging connections.

But a timeout alone isn't enough. It must be paired with a Circuit Breaker to stop the bleeding and a clear fallback path to keep the user experience alive while the dependency recovers.

Retries Can Be More Dangerous Than Failures

Retries sound like a safety net, but in production, they are often a self-inflicted DDoS attack. When an external service struggles, thousands of your application instances retrying simultaneously will crush whatever remaining capacity that service has left. This is the Retry Storm.

The Wrong Way	The Reliable Way
Immediate retries (tight loops).	Exponential backoff + Jitter.
Retry everything (4xx and 5xx).	Only retry transient errors (503/504).
Infinite retries.	Strict retry limits and Dead Letter Queues.

The Hard Truth

Sometimes the most reliable thing you can do is stop trying. Not every failure should be hidden from the user or the system.

“Fail and surface the issue.”

If an API returns a 400 (Bad Request) or a 401 (Unauthorized), no amount of retrying will fix the problem. Recognizing the difference between a transient network glitch and a permanent logic error is the difference between a resilient system and a broken one.

Observability Is Part of the Integration

You don’t control external APIs, but you do control how well you understand their failures. Without deep visibility, debugging a failed payment or a dropped webhook becomes a game of finger-pointing between your team and the provider.

The Blind Spot	The Observable Solution
Did the API fail or was it our network?	Structured logs with egress timing.
Which user request caused this error?	Correlation IDs across boundaries.
Is this a one-off or a provider outage?	Error categorization and dashboards.

The Debugging Rule

If an integration fails and you can’t answer What, Why, and How Often immediately, your reliability strategy is just guesswork.

Observability is not a luxury - it’s the only way to debug systems you don’t own. By investing in request tracing and clear error classification, you stop treating external APIs as black boxes and start treating them as measurable components of your infrastructure.

Degrading Gracefully Beats Being Correct

In a laboratory environment, systems should always be correct. But in production, reliability is often a measure of usability under pressure. When a dependency fails, your system shouldn't just roll over and die; it should shrink its footprint to protect the core experience.

Scenario	Graceful Degradation Path
Price/Stock API is down	Show stale cached data with a 'Last updated' hint.
Analytics provider is lagging	Fire-and-forget or delay ingestion; don't block the UI.
Secondary feature failure	Hide the widget entirely while keeping the checkout flow alive.

The Trust Factor

Users tolerate partial failure far better than a blank screen. Graceful degradation keeps trust intact by keeping the system usable.

This approach requires you to categorize your features into 'Critical' and 'Non-critical'. If your auth service is up, but your avatar-generation API is down, the user should still be able to log in. Reliability is the art of failing in a way that the user barely notices.

Reliability Is a Product Decision

One of the hardest truths in backend engineering is that technical excellence has a price tag. Reliability is not an accidental byproduct of good code; it is a deliberate product choice.

Engineering can build a system that never fails, but Product might not be able to afford the complexity or the bill.

The Reliability Feature	The Trade-off
Aggressive Retries	Increased infrastructure costs and potential for 'retry storms'.
Redundancy / Multi-Region	Significant architectural complexity and maintenance overhead.
Complex Fallbacks	More design and QA effort to ensure UX stays consistent.

The Strategic Conversation

Instead of assuming 100% uptime, teams must decide: What failures are acceptable? What latency is tolerable? What guarantees actually matter to our users?

Backend engineers are the ones who implement these safeguards, but the best outcomes happen when reliability is discussed explicitly at the product level. When everyone understands the cost of 'perfection', you can build a system that is robust enough for reality without being over-engineered for theory.

Final Thoughts

Building reliable systems on top of unreliable APIs isn’t about perfection.

It’s about:

Expecting failure
Containing damage
Preserving user trust
Making failure understandable, not mysterious

The Reliability Mindset

Reliability emerges when systems are designed for reality - not for ideal conditions. Most production issues aren’t caused by bad code; they’re caused by optimistic assumptions.

If you are integrating with external APIs, assume they will fail. Once you move past the 'Happy Path' and start designing for the gray areas - the timeouts, the duplicates, and the partial outages - you stop building lucky systems and start building resilient ones.

Build something that survives the chaos.

Thanks for reading.

Written by Sanket Dofe

Full-stack engineer & system architect. I build scalable products and write about engineering clarity.

Building Reliable Systems on Top of Unreliable APIs

Share this article

The Myth of the “Reliable API”

Reliability Starts With Expecting Failure

Idempotency Is Not Optional

Practical Lessons for Implementation

Webhooks: The Illusion of Order

Timeouts Are a Design Decision

Retries Can Be More Dangerous Than Failures

Observability Is Part of the Integration

Degrading Gracefully Beats Being Correct

Reliability Is a Product Decision

Final Thoughts

Your Take?

Join the Conversation

Keep Reading

Why Observability Is an Engineering Skill, Not a Tool Choice

Why Debuggability Is a Feature, Not a Nice-to-Have

The Day Production Taught Me What Ownership Really Means

When Clean Architecture Breaks Down in Real Products