Date: 2026-06-14
Primary symptom: pages opened slowly and some requests returned 504 or 522-like timeout errors
Primary cause: relay-to-origin TCP connection pressure, not slow application response
The production path was not a single-hop request from users to the application server. Traffic passed through a relay layer before reaching the origin server:
users -> edge or relay layer -> origin web server -> application server
That detail mattered. A slow browser experience can come from several different layers:
The first checks showed that the application was responding quickly when tested locally from the origin. That changed the investigation from "why is the app slow?" to "where is the request waiting before it reaches the app?"
The visible symptoms were simple:
From the outside, this looked like an application performance problem. The important evidence later showed that many failed requests never reached the application at all.
Local checks on the origin showed fast responses from both the local web server and the application server. Common pages and lightweight endpoints did not show application-level blocking.
That distinction is important: public latency and application latency are not the same measurement. If the relay cannot establish a TCP connection to the origin, the application never gets a chance to be slow or fast.
The origin access logs also did not show a matching wave of application-side 504 errors, which supported the same conclusion: the failed requests were being lost before the application layer.
The relay web server showed many 504 responses. The decisive error pattern was:
upstream timed out while connecting to upstream
The phrase while connecting to upstream was the key. It means:
Many failed requests spent almost the full upstream timeout waiting for connection establishment.
On the relay, many connections to the origin were stuck in SYN-SENT.
SYN-SENT means the relay has sent a TCP SYN packet and is waiting for the connection handshake to complete. When many entries pile up in that state, new relay-to-origin connections are not being established quickly enough.
On the origin, TCP counters showed listen queue pressure. The exact counters vary by system, but the useful signals are:
ListenOverflows
ListenDrops
TCPReqQFullDoCookies
SyncookiesSent
The incident pattern was:
relay creates many short-lived origin connections
-> origin accept/SYN queues come under pressure
-> relay waits during TCP connect
-> relay web server returns 504 after timeout
-> users see slow page loads or connection errors
The confusing part was that the application was fast while users still saw slow pages.
That can happen when the slow step is before the application:
user request reaches relay
relay tries to connect to origin
TCP connect is delayed or times out
application never receives that request
relay returns 504
So the correct explanation was not "the application is slow." It was "the relay-to-origin connection path is saturated by short connection churn and origin listen queue pressure."
The durable fix was made at the relay layer: enable upstream keepalive for relay-to-origin connections and shorten the upstream connection timeout.
Generic Nginx shape:
upstream origin_backend {
server origin_backend_address;
keepalive 512;
}
server {
location / {
proxy_pass http://origin_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_connect_timeout 5s;
}
}
The important parts are:
keepalive lets the relay reuse established TCP connections to the origin.proxy_http_version 1.1 and an empty Connection header allow upstream keepalive reuse.proxy_connect_timeout prevents failed connection attempts from occupying relay resources for too long.The origin was also tuned to tolerate larger connection bursts:
net.core.somaxconn = 8192
net.ipv4.tcp_max_syn_backlog = 8192
net.ipv4.tcp_syncookies = 1
The origin web server's shared listen socket backlog was also raised:
listen 80 default_server backlog=8192;
listen [::]:80 default_server backlog=8192;
That backlog setting belongs to the listen socket, not to an individual virtual host. For name-based virtual hosts on the same port, the TCP connection enters the shared listen queue before the web server reads the HTTP Host header. Configure the backlog once on the shared listener and verify the effective socket queue after reload.
Before the change, the relay created many fresh TCP connections to the origin. Under load, that caused connection churn and increased pressure on the origin's listen and SYN queues.
Upstream keepalive changed the behavior. The relay could keep a pool of established connections open and reuse them for many requests.
That reduced:
SYN-SENT.After making a similar change, verify each layer separately.
Check the origin locally:
curl -sS -o /dev/null --max-time 10 \
-w 'status=%{http_code} connect=%{time_connect} ttfb=%{time_starttransfer} total=%{time_total}\n' \
http://127.0.0.1/
Check relay errors:
tail -200 /var/log/nginx/error.log | grep -E 'upstream timed out|connect\(\) failed|while connecting'
Check relay-to-origin TCP states:
ss -ant dst ORIGIN_HOST:80 | awk 'NR>1{state[$1]++} END{for(s in state) print s,state[s]}'
Check origin listen queue pressure:
awk '
/TcpExt:/ {
if (!seen) {
n=split($0,h)
seen=1
} else {
split($0,v)
for (i=2;i<=n;i++) {
if (h[i] ~ /ListenOverflows|ListenDrops|TCPReqQFullDoCookies|SyncookiesSent/) {
printf "%s=%s ", h[i], v[i]
}
}
print ""
exit
}
}
' /proc/net/netstat
Check the effective listen backlog:
ss -lntp | awk 'NR==1 || /:80 /'
After the fix, the expected direction is:
SYN-SENT entries disappear or stay low.Old 504 log entries can still appear briefly after a reload because already-hung requests may continue until their timeout completes.
The main lesson is to test each hop separately.
For a relay-to-origin topology, a practical investigation order is:
In this incident, relay logs and TCP state were the decisive evidence. The application was fast, but users were slow because the relay could not reliably establish origin connections under short-connection pressure.
还没有人评论,抢个沙发吧...