Re: 6.x gives frequent connection to peer failed - spurious?

Alex Rousskov <rousskov@xxxxxxxxxxxxxxxxxxxxxxx> · Tue, 21 Nov 2023 10:55:13 -0500

On 2023-11-21 08:38, Stephen Borrill wrote:
On 15/11/2023 21:55, Alex Rousskov wrote:
On 2023-11-10 05:46, Stephen Borrill wrote:

With 6.x (currently 6.5) there are very frequent (every 10 seconds or 
so) messages like:
2023/11/10 10:25:43 kid1| ERROR: Connection to 127.0.0.1:8123 failed

why is this logged as a connection failure

The current error wording is too assuming and, in your case, evidently 
misleading. The phrase "Connection to X failed" should be changed to 
something more general like "Cannot contact cache_peer X" or "Cannot 
communicate with cache_peer X".

CachePeer::countFailure() patches welcome.

But the point is that it _can_ communicate with the peer, but the peer 
itself can't service the request. The peer returning 503 shouldn't be 
logged as a connection failure

My bad. I missed the fact that the described DNS error happens at a 
_peer_ Squid. Sorry.

Currently, Squid v6 treats most CONNECT-to-peer errors as a sign of a 
broken peer. In 2022, 4xx errors were excluded from that set[1]. At that 
time, we also proposed to make that decision configurable using a new 
cache_peer_fault directive[2], but the new directive was blocked as an 
"overkill"[3], so we hard-coded 4xx exclusion instead.

Going forward, you have several options, including these two:

1. Convince others that Squid should treat all 503 CONNECT errors from 
peers as it already treats all 4xx errors. Hard-code that new logic.

2. Convince others that cache_peer_fault or a similar directive is a 
good idea rather than an overkill. Resurrect its implementation[2].

[1] 
https://github.com/squid-cache/squid/commit/022dbabd89249f839d1861aa87c1ab9e1a008a47

[2] 
https://github.com/squid-cache/squid/commit/25431f18f2f5e796b8704c85fc51f93b6cc2a73d

[3] https://github.com/squid-cache/squid/pull/1166#issuecomment-1295806530

HTH,

Alex.

 > do I need to worry about it beyond it filing up the logs needlessly?

In short, "yes".

I cannot accurately assess your specific needs, but, in most 
environments, one should indeed worry that their cache_peer server 
names cannot be reliably resolved because failed resolution attempts 
waste Squid resources and increase transaction response time. 
Moreover, if these failures are frequent enough (relative to peer 
usage attempts), the affected cache_peer will be marked as DEAD (as 
you have mentioned):

 > 2023/11/09 08:55:22 kid1| Detected DEAD Parent: 127.0.0.1:8123

Problem seems to be easily reproducible:

1# env https_proxy=http://127.0.0.1:8084 curl https://www.invalid.domain/
curl: (56) CONNECT tunnel failed, response 503
2# grep invalid /usr/local/squid/logs/access.log|tail -1
1700573429.015      4 127.0.0.1:8084 TCP_TUNNEL/503 0 CONNECT 
www.invalid.domain:443 - FIRSTUP_PARENT/127.0.0.1:8123 -
3# date -r 1700573429 '+%Y/%m/%d %H:%M:%S'
2023/11/21 13:30:29
4# grep '2023/11/21 13:30:29' /usr/local/squid/logs/cache.log
2023/11/21 13:30:29 kid1| ERROR: Connection to 127.0.0.1:8123 failed

With 4.x there were no such messages.

By comparing to the peer squid logs, these seems to tally with DNS 
failures:
peer_select.cc(479) resolveSelected: PeerSelector1688 found all 0 
destinations for bugzilla.tucasi.com:443

Full ALL,2 log at the time of the reported connection failure:

2023/11/10 10:25:43.162 kid1| 5,2| TcpAcceptor.cc(214) doAccept: New 
connection on FD 17
2023/11/10 10:25:43.162 kid1| 5,2| TcpAcceptor.cc(316) acceptNext: 
connection on conn3 local=127.0.0.1:8123 remote=[::] FD 17 flags=9
2023/11/10 10:25:43.162 kid1| 11,2| client_side.cc(1332) 
parseHttpRequest: HTTP Client conn13206 local=127.0.0.1:8123 
remote=127.0.0.1:57843 FD 147 flags=1
2023/11/10 10:25:43.162 kid1| 11,2| client_side.cc(1336) 
parseHttpRequest: HTTP Client REQUEST:
2023/11/10 10:25:43.162 kid1| 85,2| client_side_request.cc(707) 
clientAccessCheckDone: The request CONNECT bugzilla.tucasi.com:443 is 
ALLOWED; last ACL checked: localhost
2023/11/10 10:25:43.162 kid1| 85,2| client_side_request.cc(683) 
clientAccessCheck2: No adapted_http_access configuration. default: ALLOW
2023/11/10 10:25:43.162 kid1| 85,2| client_side_request.cc(707) 
clientAccessCheckDone: The request CONNECT bugzilla.tucasi.com:443 is 
ALLOWED; last ACL checked: localhost
2023/11/10 10:25:43.162 kid1| 44,2| peer_select.cc(460) 
resolveSelected: Find IP destination for: bugzilla.tucasi.com:443' 
via bugzilla.tucasi.com
2023/11/10 10:25:43.163 kid1| 44,2| peer_select.cc(479) 
resolveSelected: PeerSelector1526 found all 0 destinations for 
bugzilla.tucasi.com:443
2023/11/10 10:25:43.163 kid1| 44,2| peer_select.cc(480) 
resolveSelected:    always_direct = ALLOWED
2023/11/10 10:25:43.163 kid1| 44,2| peer_select.cc(481) 
resolveSelected:     never_direct = DENIED
2023/11/10 10:25:43.163 kid1| 44,2| peer_select.cc(482) 
resolveSelected:         timedout = 0
2023/11/10 10:25:43.163 kid1| 4,2| errorpage.cc(1397) buildBody: No 
existing error page language negotiated for ERR_DNS_FAIL. Using 
default error file.
2023/11/10 10:25:43.163 kid1| 33,2| client_side.cc(617) swanSong: 
conn13206 local=127.0.0.1:8123 remote=127.0.0.1:57843 flags=1

If my analysis is correct why is this logged as a connection failure 
and do I need to worry about it beyond it filing up the logs needlessly?

My concern is that this could lead to the parent being incorrectly 
declared DEAD thus impacting other traffic:

2023/11/09 08:55:22 kid1| Detected DEAD Parent: 127.0.0.1:8123
     current master transaction: master4581234

_______________________________________________
squid-users mailing list
squid-users@xxxxxxxxxxxxxxxxxxxxx
https://lists.squid-cache.org/listinfo/squid-users