Why DNS Failover Doesn't Work (And What to Do Instead)

February 2025

Every infrastructure engineer learns the theory: point your application at a DNS hostname, set a low TTL, and when you need to failover, update the DNS record. Traffic shifts to the new database within minutes.

In practice, it rarely works that cleanly. Connection pools hold onto resolved IPs. Load balancers cache DNS responses. Applications have their own resolver behavior. When you actually need to failover, you discover that "update the DNS" means touching 24 servers' hosts files at 3 AM.

This post explains why DNS-based failover fails and what actually works.

The Theory vs. Reality

The DNS failover pattern assumes:

Applications resolve DNS on every connection
DNS TTLs are respected by all components
TCP connections are re-established when DNS changes

None of these assumptions hold in production systems.

Connection Pooling Breaks Assumption 1

Modern applications maintain connection pools to databases. A pool of 10 connections to db.company.com resolves the DNS once when the pool is created. Those connections persist for hours, days, or until the application restarts.

When you update DNS, the pool still has connections to the old IP. New connections will go to the new IP, but with connection reuse, new connections are rare. The application effectively ignores the DNS change.

Resolver Caching Breaks Assumption 2

Even with a 60-second TTL, multiple layers cache DNS responses:

Operating system resolver cache
Application-level resolver cache
Load balancer DNS cache
Network appliance DNS cache
Intermediate DNS servers

Each layer may have its own TTL behavior. Some respect the authoritative TTL. Some impose minimums (commonly 30 seconds or 5 minutes). Some cache until explicitly cleared.

A 60-second TTL does not mean changes propagate in 60 seconds. It means changes begin propagating in 60 seconds under ideal conditions.

TCP Keepalive Breaks Assumption 3

Database connections often use TCP keepalive to maintain idle connections. A connection can stay open for hours with periodic keepalive packets. The connection does not care that DNS has changed—it has a TCP socket to an IP address.

For the connection to re-resolve DNS, it must first close. But healthy connections do not close. Your failover depends on connections failing, but if the old database is still responding to TCP packets (even if the service is degraded), connections persist.

What Actually Happens in a Failover

You have 24 application servers connected to db.company.com, which resolves to your primary database at 10.0.1.100. You need to failover to the replica at 10.0.1.200.

T+0: You update DNS. db.company.com now resolves to 10.0.1.200.

T+1 minute: DNS propagation begins. Some resolvers have the new IP.

T+5 minutes: Most resolvers have the new IP. But applications still have connection pools pointing at 10.0.1.100.

T+30 minutes: Connections slowly age out. New connections go to 10.0.1.200. Old connections still hitting 10.0.1.100.

T+2 hours: Still seeing traffic to 10.0.1.100. Connection pool minimum sizes keep old connections alive.

Eventually someone gives up and edits hosts files on all 24 servers. Failover complete—two and a half hours later.

Solutions That Actually Work

Application-Level Failover

The cleanest solution is to make the application aware of failover. Instead of relying on DNS, the application explicitly knows about primary and replica endpoints.

Connection string approach:

Server=primary.db.company.com,replica.db.company.com;Failover Partner=replica.db.company.com;MultiSubnetFailover=True

Many database drivers support failover partners, multi-subnet failover, or similar features. The driver maintains awareness of both endpoints and switches when the primary becomes unavailable.

Service discovery:
Applications query a service discovery system (Consul, etcd, or your cloud provider's service discovery) for the current database endpoint. When the endpoint changes, the application receives the update and reconnects.

Database Proxy Layer

Put a proxy between applications and databases. The proxy maintains a stable endpoint while handling failover internally.

Options:

ProxySQL (MySQL)
PgBouncer with failover scripts (PostgreSQL)
HAProxy with health checks
Cloud-managed proxies (AWS RDS Proxy, Azure SQL Managed Instance)

The applications connect to the proxy's stable IP. When failover occurs, the proxy redirects connections. Applications do not need to change anything.

Trade-offs:

Additional latency (typically 1-3ms)
New failure mode (proxy itself can fail)
Operational overhead to manage proxy infrastructure

For many organizations, these trade-offs are worth the failover simplicity.

Cloud-Native Load Balancers

Cloud providers offer TCP load balancers with health checks:

AWS Network Load Balancer
Azure Load Balancer
GCP Network Load Balancer

Configure the load balancer with backend targets for both database instances. Health checks detect when the primary fails, and traffic shifts to the replica automatically.

Important: This works for stateless or read traffic. For databases with replication, you need additional logic to handle promotion and prevent split-brain scenarios.

Orchestrated Failover

For database clusters with proper orchestration, the cluster itself handles failover:

SQL Server Always On Availability Groups
PostgreSQL with Patroni
MySQL with Orchestrator
Cloud-managed failover (RDS Multi-AZ, Azure SQL Business Critical)

The cluster maintains a listener or VIP that automatically moves to the new primary. Applications connect to the listener, and failover is handled at the database layer.

This is the gold standard for production databases. The additional complexity of cluster management is repaid with reliable, automated failover.

The Connection Pool Problem

Even with proper failover infrastructure, connection pools need special handling.

Configure Pool Lifetime

Set connection lifetime limits so connections are recycled regularly:

// .NET
"Connection Lifetime=300;"  // Connections recycled after 5 minutes

// Java
hikariConfig.setMaxLifetime(300000);  // 5 minutes

This ensures that even idle connections eventually close and re-resolve, picking up failover changes.

Configure Validation

Configure connection validation so the pool detects dead connections:

// HikariCP
hikariConfig.setConnectionTestQuery("SELECT 1");
hikariConfig.setValidationTimeout(3000);

When the primary fails, validation queries fail, and the pool discards those connections. New connections go to the failover target.

Configure Timeouts

Aggressive timeouts on connection acquisition force quick failure and reconnection:

# SQLAlchemy
engine = create_engine(
    DATABASE_URL,
    pool_pre_ping=True,
    pool_recycle=300,
    connect_args={"connect_timeout": 5}
)

Handle Connection Errors Gracefully

Applications should expect connection failures and handle them with retry logic:

for attempt in range(3):
    try:
        result = execute_query(sql)
        break
    except ConnectionError:
        if attempt == 2:
            raise
        time.sleep(1)

What Should You Use?

For cloud-managed databases:
Use the cloud provider's failover features. RDS Multi-AZ, Azure SQL failover groups, and Cloud SQL HA all handle failover automatically. Your job is to configure your application's connection pool to recover quickly.

For self-managed databases with HA requirements:
Use a database cluster with proper orchestration (Patroni, Orchestrator, Always On) or place a proxy layer in front of your databases.

For legacy systems where you cannot change the application:
A TCP load balancer with health checks is your best option. Configure health checks to detect database failures, and let the load balancer handle traffic routing.

For development and non-critical systems:
DNS-based failover might be acceptable if you can tolerate longer failover times. But do not rely on it for anything that requires fast recovery.

Conclusion

DNS-based failover is attractive because it requires no application changes. But it fails in practice because connection pools, resolver caching, and TCP behavior all conspire to keep connections pointed at the old IP.

Reliable failover requires either making the application aware of multiple endpoints, using a proxy layer to abstract the failover, or using database clustering with built-in listener/VIP support.

The next time someone says "we will just update DNS," ask them how they plan to handle connection pools. The answer to that question determines whether the failover will work.

← Back to Blog