Partial AdGuard DNS outage on November 25, 2025

Between approximately 15:30 and 18:00 UTC on November 25, AdGuard DNS experienced a major outage affecting servers in multiple locations, primarily in Europe. During this time, some AdGuard DNS users observed slow or failed DNS resolution, including timeouts and SERVFAIL responses.

We apologize for the incident and would like to provide an explanation of what happened, how we resolved the issue, and the changes we’re implementing to prevent this from happening again.

TLDR

  • Impact. A significant portion of traffic in several European locations, primarily Amsterdam, Frankfurt, and London, experienced intermittent DNS failures (timeouts and SERVFAIL) for up to 2.5 hours.
  • Root cause. A bug in the user data cache logic caused a large number of full synchronizations with the business‑logic service, which, combined with suboptimal memory and connection‑limit settings, led to resource exhaustion on several DNS clusters.
  • Mitigation. The buggy version was rolled back, corrupted caches were repaired, traffic was temporarily rerouted, and connection/memory limits were tuned; additional safeguards and tests are being implemented to avoid similar failures.

What happened

AdGuard DNS resolvers require users’ filtering settings to be available locally before they can apply filters. At startup, the resolver loads a cached copy of all user data from a local file and then periodically queries the business‑logic service only for incremental changes since that snapshot. If the local cache is stale or corrupted, the resolver has to request a full dataset from the business‑logic service, which is slow and generates significantly higher load on both sides.

On November 25, a new version of AdGuard DNS was deployed with the goal of improving this mechanism. Due to a bug, instead of saving the complete user dataset into the local cache file, the service saved only the recently changed subset, which was often very small or even empty. As a result, many resolvers started up with either no user data or with only a tiny fraction of it, triggering repeated full synchronizations.

The first deployment attempt failed partway through due to a network issue, so about half of the fleet was running the new buggy version while the rest were still on the old one. When the deployment was retried, a large number of resolvers simultaneously discovered that their local caches were effectively empty or invalid and started full‑syncing user data. This caused a spike of traffic to the business‑logic service and a surge in resource usage on the DNS resolvers themselves.

Amsterdam: hardware mismatch and overload

In Amsterdam, the impact of this spike was amplified by a hardware mismatch. The cluster consisted of two groups of machines: some with 32 GiB of RAM and others with 64 GiB, while the AdGuard DNS configuration (connection limits, cache sizes, memory settings) had been tuned for the 64‑GiB profile.

Under the sudden load from full synchronizations and a storm of retrying DNS clients, the 32‑GiB machines ran out of usable memory and hit internal limits much sooner. They degraded sharply and effectively stopped serving traffic, which caused most of the Amsterdam DNS load to shift to the 64‑GiB machines. These remaining nodes then also entered an overloaded state: connection limits were reached, queues grew, and DNS queries started timing out or failing with SERVFAIL, even though upstream DNS resolvers were still healthy and answering quickly.

While working on the rollback of the buggy version and on cache repair, traffic from Amsterdam was partially rerouted to other European locations. This helped stabilize Amsterdam but introduced heavy additional load on Frankfurt and London.

Frankfurt and London: connection limits, pipelining, and latency

In Frankfurt and London, the underlying hardware was uniform, but the misconfigured limits and high concurrency still led to severe degradation. Two effects were especially important:

  1. Connection limits and pipelining
    The resolver uses an internal limit on the number of active connections and supports request pipelining, allowing multiple concurrent queries per connection. With a pipeline depth of 5, each incoming connection could carry several in‑flight requests, each handled by separate goroutines and buffers. Under heavy load, this resulted in a large number of concurrent goroutines and live objects in memory, which increased heap size and put pressure on the Go garbage collector.

  2. High latency and SERVFAIL despite healthy upstreams
    As memory pressure grew, garbage collection became more frequent and more expensive, increasing the time each DNS query spent inside the resolver. This led to a sharp rise in request duration and in the number of SERVFAIL responses, because client‑side and internal timeouts were reached before responses could be processed, even though upstream DNS servers were still responding quickly and successfully.

In Frankfurt specifically, CPU usage never appeared fully saturated at the system level, but GC activity and internal contention were enough to significantly increase tail latency. During this period, the cluster hit its connection limit and stayed at it: existing connections were held for longer than usual, while new clients struggled to establish new connections. Clients responded with retries, increasing load even further and keeping the connection limit saturated.

When the request pipeline depth was reduced from 5 to 1 in Frankfurt, the system recovered very quickly. With pipeline=1, each connection carried at most one in‑flight DNS query at a time, dramatically reducing the number of concurrent goroutines and live objects in memory per connection. This in turn reduced heap growth and GC pressure, allowing the resolver to complete requests faster, close connections in a timely manner, and free up connection slots. As a result, the connection limit was no longer constantly saturated, request duration dropped, and the rate of SERVFAIL responses fell back to normal levels.

Key problems: Memory, connlimit, and 32 vs 64 GiB nodes

An important factor in how different nodes behaved was the interaction between available RAM, Go’s garbage collector, and the connlimit setting. Сonnlimit controls the maximum amount of memory the Go runtime will use for managed heap before aggressively triggering garbage collection. On high‑RAM nodes, if connlimit is set too high, the runtime allows the heap to grow much larger before GC becomes aggressive.

On the 64‑GiB machines, the effective memory limit was higher, so under heavy concurrency and high pipelining, the resolver process accumulated a large heap with many goroutines and buffers. Once this heap reached the configured limit, the GC had to work very hard to keep memory usage under control, resulting in prolonged periods of high GC CPU utilization and more frequent pauses. This is a known worst‑case scenario for connlimit on high‑memory applications: memory is not exhausted, but CPU time is increasingly spent in GC, leading to higher latency and reduced throughput.

On the 32‑GiB machines, the same configuration and load caused the process to reach practical memory limits earlier. These nodes started failing sooner because their configuration (cache sizes, connection limits) had not been scaled down to match the smaller RAM footprint, but the 64‑GiB nodes eventually became bottlenecked by GC activity instead. In combination with high pipelining and high retry traffic, this created a feedback loop where GC pressure, connection saturation, and client retries reinforced each other.

Incident timeline

  • 14:30 UTC: The first deployment of the new version begins. Due to a network issue, the deployment fails approximately an hour later, leaving about half of all servers running the new buggy version.

  • 15:30–16:00 UTC: The second deployment attempt starts. Alerts are triggered for increased DNS error rates and server load. The investigation quickly identifies the bug in the caching logic that causes resolvers to operate with empty or incomplete local user data and to perform full synchronizations.

  • 16:00–16:30 UTC: A decision is made to roll back the buggy version on all servers simultaneously. Under the combined load of full synchronizations and client retries, the Amsterdam cluster with mixed 32‑GiB/64‑GiB hardware becomes overloaded; 32‑GiB nodes effectively drop out first, pushing more traffic onto 64‑GiB nodes, which then also start failing. Traffic is partially rerouted from Amsterdam to other European locations, increasing load on Frankfurt and London.

  • 16:30–17:30 UTC: Configuration changes are applied to reduce the effective connection concurrency and pipeline depth in Frankfurt and London, and to adjust limits to the actual hardware capacity. These changes gradually bring clusters back to a stable state. In parallel, corrupted cache files and user data snapshots are repaired to avoid further full synchronizations.

  • 17:30–18:00 UTC: The rollback is completed, caches are restored to a consistent state, traffic distribution is normalized, and error rates return to baseline.

Why queries failed (SERVFAIL and high latency)

During the outage, users primarily saw DNS timeouts and SERVFAIL responses. It is important to stress that upstream DNS providers were operating normally and returning valid answers in a timely manner. The failures were caused by the resolver layer being unable to process and return responses before internal or client‑side timeouts expired.

Several factors contributed to this:

  1. Saturated connection limits and high pipelining depth caused many in‑flight DNS queries to be queued or delayed.
  2. Increased heap size and GC pressure led to longer response processing times and occasional pauses, especially on 64‑GiB nodes.
  3. Clients and recursive resolvers retried failed queries, further increasing load and keeping the system close to its limits.

Follow‑up steps

To prevent similar incidents in the future, the following actions will be taken:

  1. Code and testing improvements
    1. Add automated tests to verify that the full user dataset is correctly serialized into and restored from the local cache file, including detection of empty or truncated caches.
    2. Introduce integration tests that simulate resolver startup with an empty or corrupted cache under realistic load, ensuring that the system behaves gracefully and does not overwhelm business‑logic services.
  2. Configuration and limit tuning
    1. Standardize server hardware profiles per location to avoid mixed clusters with significantly different RAM sizes, or explicitly maintain separate configurations per hardware class.
    2. Review and adjust connlimit, cache sizes, and connection limits for each standard node type so that both 32‑GiB and 64‑GiB nodes (or their future equivalents) operate within safe GC and memory envelopes.
    3. Reevaluate and lower pipelining depth where necessary to keep concurrency per connection at a level that does not produce excessive goroutine and heap growth under peak conditions.
    4. Reevaluate our approach to connection and goroutine limiting to allow for both reasonable limitations on resource consumption and decent throughput even after a connection break.
  3. Monitoring and alerting
    1. Extend monitoring to track not only CPU and overall memory, but also Go heap size, GC time, GC cycles per second, and per‑cluster request latency distributions.
    2. Add dedicated alerts for sustained increases in SERVFAIL rate and request duration in the presence of healthy upstreams, to detect internal contention and GC‑related issues before they lead to a user‑visible outage.
  4. Operational procedures
    1. Refine deployment and rollback procedures for DNS resolvers to avoid simultaneous cache invalidation across a large fraction of the fleet.
    2. Document and rehearse playbooks for quickly adjusting pipeline depth, connection limits, and connlimit during live incidents, based on the behavior observed in Amsterdam and Frankfurt.
Vam je bila objava všeč?