Re: 6.x gives frequent connection to peer failed - spurious?

Tommy Brunn <tommy.brunn@xxxxxxxxxx> · Mon, 29 Jan 2024 14:48:06 +0100

Thank you for sharing your patch to work around this issue. We deploy
Squid in a very similar configuration to what you describe, and we
recently had a problem where a destination was down, causing the peer
squid to be marked as dead even though the peer Squid was just fine
and it was just the upstream destination that was unavailable. The
proposed `cache_peer_fault` directive sounds like it would be a
perfect fit for our case, but in the absence of such a directive we'll
be applying a similar patch to what you shared.

On Wed, Nov 22, 2023 at 11:15 AM Stephen Borrill <squid@xxxxxxxxxxxxxx> wrote:
>
> On 21/11/2023 15:55, Alex Rousskov wrote:
> > On 2023-11-21 08:38, Stephen Borrill wrote:
> >> On 15/11/2023 21:55, Alex Rousskov wrote:
> >>> On 2023-11-10 05:46, Stephen Borrill wrote:
> >>>
> >>>> With 6.x (currently 6.5) there are very frequent (every 10 seconds
> >>>> or so) messages like:
> >>>> 2023/11/10 10:25:43 kid1| ERROR: Connection to 127.0.0.1:8123 failed
> >>>
> >>>
> >>>> why is this logged as a connection failure
> >>>
> >>> The current error wording is too assuming and, in your case,
> >>> evidently misleading. The phrase "Connection to X failed" should be
> >>> changed to something more general like "Cannot contact cache_peer X"
> >>> or "Cannot communicate with cache_peer X".
> >>>
> >>> CachePeer::countFailure() patches welcome.
> >
> >> But the point is that it _can_ communicate with the peer, but the peer
> >> itself can't service the request. The peer returning 503 shouldn't be
> >> logged as a connection failure
> >
> >
> > My bad. I missed the fact that the described DNS error happens at a
> > _peer_ Squid. Sorry.
> >
> >
> > Currently, Squid v6 treats most CONNECT-to-peer errors as a sign of a
> > broken peer. In 2022, 4xx errors were excluded from that set[1]. At that
> > time, we also proposed to make that decision configurable using a new
> > cache_peer_fault directive[2], but the new directive was blocked as an
> > "overkill"[3], so we hard-coded 4xx exclusion instead.
> >
> > Going forward, you have several options, including these two:
> >
> > 1. Convince others that Squid should treat all 503 CONNECT errors from
> > peers as it already treats all 4xx errors. Hard-code that new logic.
> >
> > 2. Convince others that cache_peer_fault or a similar directive is a
> > good idea rather than an overkill. Resurrect its implementation[2].
> >
> >
> > [1]
> > https://github.com/squid-cache/squid/commit/022dbabd89249f839d1861aa87c1ab9e1a008a47
> >
> > [2]
> > https://github.com/squid-cache/squid/commit/25431f18f2f5e796b8704c85fc51f93b6cc2a73d
> >
> > [3] https://github.com/squid-cache/squid/pull/1166#issuecomment-1295806530
>
>
> 2) seems sensible. Especially in the case where you have a single
> cache_peer and cannot go direct. No benefit to marking it as dead.
>
> However, I'm currently running with 1) as per below and this stops the
> non-existent domains counting against the peer (which surely opens it to
> a DoS attack):
>
> --- src/CachePeer.cc.orig       2023-11-22 08:30:17.524266325 +0000
> +++ src/CachePeer.cc    2023-11-22 08:31:05.394052184 +0000
> @@ -71,7 +71,7 @@
>   void
>   CachePeer::noteFailure(const Http::StatusCode code)
>   {
> -    if (Http::Is4xx(code))
> +    if (Http::Is4xx(code) || code == Http::scServiceUnavailable)
>           return; // this failure is not our fault
>
>       countFailure();
>
>
> >>>  > do I need to worry about it beyond it filing up the logs needlessly?
> >>>
> >>> In short, "yes".
> >>>
> >>> I cannot accurately assess your specific needs, but, in most
> >>> environments, one should indeed worry that their cache_peer server
> >>> names cannot be reliably resolved because failed resolution attempts
> >>> waste Squid resources and increase transaction response time.
> >>> Moreover, if these failures are frequent enough (relative to peer
> >>> usage attempts), the affected cache_peer will be marked as DEAD (as
> >>> you have mentioned):
> >>>
> >>>  > 2023/11/09 08:55:22 kid1| Detected DEAD Parent: 127.0.0.1:8123
> >>
> >> Problem seems to be easily reproducible:
> >>
> >> 1# env https_proxy=http://127.0.0.1:8084 curl https://www.invalid.domain/
> >> curl: (56) CONNECT tunnel failed, response 503
> >> 2# grep invalid /usr/local/squid/logs/access.log|tail -1
> >> 1700573429.015      4 127.0.0.1:8084 TCP_TUNNEL/503 0 CONNECT
> >> www.invalid.domain:443 - FIRSTUP_PARENT/127.0.0.1:8123 -
> >> 3# date -r 1700573429 '+%Y/%m/%d %H:%M:%S'
> >> 2023/11/21 13:30:29
> >> 4# grep '2023/11/21 13:30:29' /usr/local/squid/logs/cache.log
> >> 2023/11/21 13:30:29 kid1| ERROR: Connection to 127.0.0.1:8123 failed
> >>
> >>>> With 4.x there were no such messages.
> >>>>
> >>>> By comparing to the peer squid logs, these seems to tally with DNS
> >>>> failures:
> >>>> peer_select.cc(479) resolveSelected: PeerSelector1688 found all 0
> >>>> destinations for bugzilla.tucasi.com:443
> >>>>
> >>>> Full ALL,2 log at the time of the reported connection failure:
> >>>>
> >>>> 2023/11/10 10:25:43.162 kid1| 5,2| TcpAcceptor.cc(214) doAccept: New
> >>>> connection on FD 17
> >>>> 2023/11/10 10:25:43.162 kid1| 5,2| TcpAcceptor.cc(316) acceptNext:
> >>>> connection on conn3 local=127.0.0.1:8123 remote=[::] FD 17 flags=9
> >>>> 2023/11/10 10:25:43.162 kid1| 11,2| client_side.cc(1332)
> >>>> parseHttpRequest: HTTP Client conn13206 local=127.0.0.1:8123
> >>>> remote=127.0.0.1:57843 FD 147 flags=1
> >>>> 2023/11/10 10:25:43.162 kid1| 11,2| client_side.cc(1336)
> >>>> parseHttpRequest: HTTP Client REQUEST:
> >>>> 2023/11/10 10:25:43.162 kid1| 85,2| client_side_request.cc(707)
> >>>> clientAccessCheckDone: The request CONNECT bugzilla.tucasi.com:443
> >>>> is ALLOWED; last ACL checked: localhost
> >>>> 2023/11/10 10:25:43.162 kid1| 85,2| client_side_request.cc(683)
> >>>> clientAccessCheck2: No adapted_http_access configuration. default:
> >>>> ALLOW
> >>>> 2023/11/10 10:25:43.162 kid1| 85,2| client_side_request.cc(707)
> >>>> clientAccessCheckDone: The request CONNECT bugzilla.tucasi.com:443
> >>>> is ALLOWED; last ACL checked: localhost
> >>>> 2023/11/10 10:25:43.162 kid1| 44,2| peer_select.cc(460)
> >>>> resolveSelected: Find IP destination for: bugzilla.tucasi.com:443'
> >>>> via bugzilla.tucasi.com
> >>>> 2023/11/10 10:25:43.163 kid1| 44,2| peer_select.cc(479)
> >>>> resolveSelected: PeerSelector1526 found all 0 destinations for
> >>>> bugzilla.tucasi.com:443
> >>>> 2023/11/10 10:25:43.163 kid1| 44,2| peer_select.cc(480)
> >>>> resolveSelected:    always_direct = ALLOWED
> >>>> 2023/11/10 10:25:43.163 kid1| 44,2| peer_select.cc(481)
> >>>> resolveSelected:     never_direct = DENIED
> >>>> 2023/11/10 10:25:43.163 kid1| 44,2| peer_select.cc(482)
> >>>> resolveSelected:         timedout = 0
> >>>> 2023/11/10 10:25:43.163 kid1| 4,2| errorpage.cc(1397) buildBody: No
> >>>> existing error page language negotiated for ERR_DNS_FAIL. Using
> >>>> default error file.
> >>>> 2023/11/10 10:25:43.163 kid1| 33,2| client_side.cc(617) swanSong:
> >>>> conn13206 local=127.0.0.1:8123 remote=127.0.0.1:57843 flags=1
> >>>>
> >>>> If my analysis is correct why is this logged as a connection failure
> >>>> and do I need to worry about it beyond it filing up the logs
> >>>> needlessly?
> >>>>
> >>>> My concern is that this could lead to the parent being incorrectly
> >>>> declared DEAD thus impacting other traffic:
> >>>>
> >>>> 2023/11/09 08:55:22 kid1| Detected DEAD Parent: 127.0.0.1:8123
> >>>>      current master transaction: master4581234
> >>>>
> >>>
> >
>
> _______________________________________________
> squid-users mailing list
> squid-users@xxxxxxxxxxxxxxxxxxxxx
> https://lists.squid-cache.org/listinfo/squid-users
_______________________________________________
squid-users mailing list
squid-users@xxxxxxxxxxxxxxxxxxxxx
https://lists.squid-cache.org/listinfo/squid-users