504/522 Incident Review: When the App Was Fast but Users Still Timed Out

Date: 2026-06-14
Primary symptom: pages opened slowly and some requests returned 504 or 522-like timeout errors
Primary cause: relay-to-origin TCP connection pressure, not slow application response

Background

The production path was not a single-hop request from users to the application server. Traffic passed through a relay layer before reaching the origin server:

users -> edge or relay layer -> origin web server -> application server

That detail mattered. A slow browser experience can come from several different layers:

  • The browser or outer proxy waiting on the relay.
  • The relay waiting on the origin.
  • The origin web server waiting on the application server.
  • The application or database generating a slow response.

The first checks showed that the application was responding quickly when tested locally from the origin. That changed the investigation from "why is the app slow?" to "where is the request waiting before it reaches the app?"

What Users Saw

The visible symptoms were simple:

  • Pages opened slowly.
  • Some requests returned 504.
  • With an outer proxy in front, the same failure mode could appear as a 522-style connection timeout.

From the outside, this looked like an application performance problem. The important evidence later showed that many failed requests never reached the application at all.

Finding 1: The Application Was Fast

Local checks on the origin showed fast responses from both the local web server and the application server. Common pages and lightweight endpoints did not show application-level blocking.

That distinction is important: public latency and application latency are not the same measurement. If the relay cannot establish a TCP connection to the origin, the application never gets a chance to be slow or fast.

The origin access logs also did not show a matching wave of application-side 504 errors, which supported the same conclusion: the failed requests were being lost before the application layer.

Finding 2: Relay Logs Had the Real Error

The relay web server showed many 504 responses. The decisive error pattern was:

upstream timed out while connecting to upstream

The phrase while connecting to upstream was the key. It means:

  • The relay had not yet sent the HTTP request to the origin application.
  • The relay was still trying to establish the TCP connection.
  • This was not a slow SQL query, slow template, or slow application route.

Many failed requests spent almost the full upstream timeout waiting for connection establishment.

Finding 3: TCP State Confirmed It

On the relay, many connections to the origin were stuck in SYN-SENT.

SYN-SENT means the relay has sent a TCP SYN packet and is waiting for the connection handshake to complete. When many entries pile up in that state, new relay-to-origin connections are not being established quickly enough.

On the origin, TCP counters showed listen queue pressure. The exact counters vary by system, but the useful signals are:

ListenOverflows
ListenDrops
TCPReqQFullDoCookies
SyncookiesSent

The incident pattern was:

relay creates many short-lived origin connections
-> origin accept/SYN queues come under pressure
-> relay waits during TCP connect
-> relay web server returns 504 after timeout
-> users see slow page loads or connection errors

Why It Was Confusing

The confusing part was that the application was fast while users still saw slow pages.

That can happen when the slow step is before the application:

user request reaches relay
relay tries to connect to origin
TCP connect is delayed or times out
application never receives that request
relay returns 504

So the correct explanation was not "the application is slow." It was "the relay-to-origin connection path is saturated by short connection churn and origin listen queue pressure."

What Was Changed

The durable fix was made at the relay layer: enable upstream keepalive for relay-to-origin connections and shorten the upstream connection timeout.

Generic Nginx shape:

upstream origin_backend {
    server origin_backend_address;
    keepalive 512;
}

server {
    location / {
        proxy_pass http://origin_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_connect_timeout 5s;
    }
}

The important parts are:

  • keepalive lets the relay reuse established TCP connections to the origin.
  • proxy_http_version 1.1 and an empty Connection header allow upstream keepalive reuse.
  • A shorter proxy_connect_timeout prevents failed connection attempts from occupying relay resources for too long.

The origin was also tuned to tolerate larger connection bursts:

net.core.somaxconn = 8192
net.ipv4.tcp_max_syn_backlog = 8192
net.ipv4.tcp_syncookies = 1

The origin web server's shared listen socket backlog was also raised:

listen 80 default_server backlog=8192;
listen [::]:80 default_server backlog=8192;

That backlog setting belongs to the listen socket, not to an individual virtual host. For name-based virtual hosts on the same port, the TCP connection enters the shared listen queue before the web server reads the HTTP Host header. Configure the backlog once on the shared listener and verify the effective socket queue after reload.

Why the Fix Worked

Before the change, the relay created many fresh TCP connections to the origin. Under load, that caused connection churn and increased pressure on the origin's listen and SYN queues.

Upstream keepalive changed the behavior. The relay could keep a pool of established connections open and reuse them for many requests.

That reduced:

  • TCP handshakes.
  • Short-lived connection churn.
  • Origin listen backlog pressure.
  • Relay requests stuck in SYN-SENT.
  • 504 responses caused by upstream connect timeout.

How to Verify

After making a similar change, verify each layer separately.

Check the origin locally:

curl -sS -o /dev/null --max-time 10 \
  -w 'status=%{http_code} connect=%{time_connect} ttfb=%{time_starttransfer} total=%{time_total}\n' \
  http://127.0.0.1/

Check relay errors:

tail -200 /var/log/nginx/error.log | grep -E 'upstream timed out|connect\(\) failed|while connecting'

Check relay-to-origin TCP states:

ss -ant dst ORIGIN_HOST:80 | awk 'NR>1{state[$1]++} END{for(s in state) print s,state[s]}'

Check origin listen queue pressure:

awk '
  /TcpExt:/ {
    if (!seen) {
      n=split($0,h)
      seen=1
    } else {
      split($0,v)
      for (i=2;i<=n;i++) {
        if (h[i] ~ /ListenOverflows|ListenDrops|TCPReqQFullDoCookies|SyncookiesSent/) {
          printf "%s=%s ", h[i], v[i]
        }
      }
      print ""
      exit
    }
  }
' /proc/net/netstat

Check the effective listen backlog:

ss -lntp | awk 'NR==1 || /:80 /'

After the fix, the expected direction is:

  • Relay 5xx counts drop.
  • Relay-to-origin SYN-SENT entries disappear or stay low.
  • Origin listen overflow/drop counters stop increasing.
  • Relay-to-origin connection times return to normal.
  • Local origin response time remains fast.

Old 504 log entries can still appear briefly after a reload because already-hung requests may continue until their timeout completes.

Lessons

The main lesson is to test each hop separately.

For a relay-to-origin topology, a practical investigation order is:

  1. Test the origin locally.
  2. Check origin access and error logs.
  3. Check relay access and error logs.
  4. Check relay-to-origin TCP state.
  5. Check origin listen backlog counters.
  6. Only then decide whether the bottleneck is the application, web server, relay, or TCP path.

In this incident, relay logs and TCP state were the decisive evidence. The application was fast, but users were slow because the relay could not reliably establish origin connections under short-connection pressure.

评论

还没有人评论,抢个沙发吧...

Viagle Blog

欢迎来到我的个人博客网站