Re: squid cache takes a break

Amos Jeffries <squid3@xxxxxxxxxxxxx> · Sun, 10 Sep 2017 00:10:33 +1200

On 09/09/17 05:39, Vieri wrote:
Hi,

Sorry for the title, but I really don't know how to describe what just happened today. It's really odd.

I previously posted a few similar issues which were all fixed if I increased certain parameters (ulimits, children-{max,startup,idle}, TTL, etc.).

This time however, after several days trouble-free I got another show-stopper. The local squid cache stopped serving for almost half an hour. After that, it all started working again magically. I had the chance to log into the server with ssh and try a few things:

- In the cache log I could see these messages:
Starting new bllookup helpers...
helperOpenServers: Starting 10/80 'squid_url_lookup.pl' processes
WARNING: Cannot run '/opt/custom/scripts/run/scripts/firewall/squid_url_lookup.pl' process.
WARNING: Cannot run '/opt/custom/scripts/run/scripts/firewall/squid_url_lookup.pl' process.
WARNING: Cannot run '/opt/custom/scripts/run/scripts/firewall/squid_url_lookup.pl' process.

It doesn't say much as to why it "cannot run" the external program.

Looking at the code that message only seems to get logged if there is a 
TCP/UDP connection involved and it is having packet errors (many reasons 
for that).

This is how the program is defined in squid.conf:
external_acl_type bllookup ttl=86400 negative_ttl=86400 children-max=80 children-startup=40 children-idle=10 %URI /opt/custom/scripts/run/scripts/firewall/squid_url_lookup.pl [...]

... so far matching, external ACL use a private/localhost TCP connection 
between each helper and Squid.

Other than that, the log is pretty quiet.

The HTTP clients do not get served at all. They keep waiting for a reply.

Squid is waiting for a reply from a helper about whether the request is 
allowed or not.

top - 13:52:38 up 9 days,  6:19,  2 users,  load average: 2.04, 1.82, 1.65
Tasks: 405 total,   1 running, 404 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni, 97.4 id,  0.0 wa,  0.0 hi,  2.6 si,  0.0 st
%Cpu1  :  0.0 us,  0.3 sy,  0.0 ni, 97.0 id,  0.0 wa,  0.0 hi,  2.6 si,  0.0 st
%Cpu2  :  0.3 us,  0.3 sy,  0.0 ni, 98.7 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu3  :  0.0 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu4  :  0.0 us,  0.0 sy,  0.0 ni, 96.7 id,  0.0 wa,  0.0 hi,  3.3 si,  0.0 st
%Cpu5  :  0.0 us,  0.3 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu6  :  0.3 us,  0.0 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu7  :  0.3 us,  0.0 sy,  0.0 ni, 95.7 id,  0.0 wa,  0.0 hi,  4.0 si,  0.0 st
KiB Mem : 32865056 total, 12324092 free, 16396808 used,  4144156 buff/cache
KiB Swap: 37036988 total, 35197252 free,  1839736 used. 15977208 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
29123 squid     20   0  293460 206776   8480 S   0.7  0.6  10:31.16 squid

Starting each helper requires 206766 MB of RAM to be allocated in 
swap/virtual. Copying 10x that data (+10 helpers per occurance) may take 
a while.

We workaround that normally by using concurrency. Helpers that support 
high levels of concurrency can handle a lot more than blocking /

Note that concurrency is just a way of pipelining requests to a helper, 
it does not require multi-threading though helpers using MT are 
naturally even better with concurrency.

# ps aux | grep squid | grep '\-n'
squid      805  0.0  6.4 3769624 2127988 ?     S    13:53   0:00 (squid-1) -YC -f /etc/squid/squid.conf -n squid
squid     1452  0.0  6.4 3769624 2127988 ?     S    13:55   0:00 (squid-1) -YC -f /etc/squid/squid.conf -n squid
root      3043  0.0  0.0  84856  1728 ?        Ss   Aug31   0:00 /usr/sbin/squid -YC -f /etc/squid/squid.http.conf -n squidhttp
squid     3046  0.0  0.0 128232 31052 ?        S    Aug31   0:35 (squid-1) -YC -f /etc/squid/squid.http.conf -n squidhttp
root      3538  0.0  0.0  86912  1740 ?        Ss   Aug31   0:00 /usr/sbin/squid -YC -f /etc/squid/squid.https.conf -n squidhttps
squid     3540  0.0  0.1 134940 36244 ?        S    Aug31   1:09 (squid-1) -YC -f /etc/squid/squid.https.conf -n squidhttps
root     28492  0.0  0.0  86908  4220 ?        Ss   Sep05   0:00 /usr/sbin/squid -YC -f /etc/squid/squid.owa.conf -n squidowa
squid    28495  0.0  0.0 103616 18364 ?        S    Sep05   0:12 (squid-1) -YC -f /etc/squid/squid.owa.conf -n squidowa
root     29120  0.0  0.0  86908  4220 ?        Ss   Sep05   0:00 /usr/sbin/squid -YC -f /etc/squid/squid.owa2.conf -n squidowa2
squid    29123  0.2  0.6 293460 206776 ?       S    Sep05  10:31 (squid-1) -YC -f /etc/squid/squid.owa2.conf -n squidowa2
squid    30330  0.0  6.4 3769624 2127928 ?     S    13:42   0:00 (squid-1) -YC -f /etc/squid/squid.conf -n squid
squid    30866  0.0  6.4 3769624 2127928 ?     S    13:44   0:00 (squid-1) -YC -f /etc/squid/squid.conf -n squid
squid    31507  0.0  6.4 3769624 2127928 ?     S    13:47   0:00 (squid-1) -YC -f /etc/squid/squid.conf -n squid
squid    32055  0.0  6.4 3769624 2127928 ?     S    13:49   0:00 (squid-1) -YC -f /etc/squid/squid.conf -n squid
squid    32659  0.0  6.4 3769624 2127928 ?     S    13:51   0:00 (squid-1) -YC -f /etc/squid/squid.conf -n squid

I changed the debug_options to ALL,9 and tried to restart or "reconfigure" Squid.
I had trouble with that and decided to:
- stop all squid instances (the other reverse proxies, eg. squidhttp, squidhttps, squidowa, squidowa2, were working fine)
- manually kill all squid processes related to the failing local cache (including ssl_crtd)
- made sure there were no more squid processes and that debug_options was ALL,9
- started the local squid cache only (only 1 daemon - no reverse proxies)

Surprisingly, it still didn't work...
Clients could not browse. The squid log only had this set of messages once in a while (very quiet log):

2017/09/08 14:18:10.322 kid1| 54,3| ipc.cc(180) ipcCreate: ipcCreate: prfd FD 94
2017/09/08 14:18:10.322 kid1| 54,3| ipc.cc(181) ipcCreate: ipcCreate: pwfd FD 94
2017/09/08 14:18:10.322 kid1| 54,3| ipc.cc(182) ipcCreate: ipcCreate: crfd FD 93
2017/09/08 14:18:10.322 kid1| 54,3| ipc.cc(183) ipcCreate: ipcCreate: cwfd FD 93
2017/09/08 14:18:10.322 kid1| 54,3| ipc.cc(196) ipcCreate: ipcCreate: FD 94 sockaddr [::1]:40905
2017/09/08 14:18:10.322 kid1| 54,3| ipc.cc(212) ipcCreate: ipcCreate: FD 93 sockaddr [::1]:50647
2017/09/08 14:18:10.322 kid1| 54,3| ipc.cc(222) ipcCreate: ipcCreate: FD 93 listening...
2017/09/08 14:18:10.322 kid1| 5,3| comm.cc(868) _comm_close: comm_close: start closing FD 93
2017/09/08 14:18:10.322 kid1| 5,3| comm.cc(540) commUnsetFdTimeout: Remove timeout for FD 93
2017/09/08 14:18:10.322 kid1| 5,5| comm.cc(721) commCallCloseHandlers: commCallCloseHandlers: FD 93
2017/09/08 14:18:10.322 kid1| 5,4| AsyncCall.cc(26) AsyncCall: The AsyncCall comm_close_complete constructed, this=0xeb7c10 [call123]
2017/09/08 14:18:10.322 kid1| 5,4| AsyncCall.cc(93) ScheduleCall: comm.cc(941) will call comm_close_complete(FD 93) [call123]
2017/09/08 14:18:10.322 kid1| 5,9| comm.cc(602) comm_connect_addr: connecting socket FD 94 to [::1]:50647 (want family: 10)
2017/09/08 14:18:10.322 kid1| 21,3| tools.cc(543) leave_suid: leave_suid: PID 8303 called
2017/09/08 14:18:10.323 kid1| 21,3| tools.cc(636) no_suid: no_suid: PID 8303 giving up root priveleges forever
2017/09/08 14:18:10.323 kid1| 54,3| ipc.cc(304) ipcCreate: ipcCreate: calling accept on FD 93

Whatever this helper being started on FD 93/94 was, it is successful.
We know it is a helper because the FD are used by ipc.cc (helper I/O 
channels) for something, but that is all.

[...]
2017/09/08 14:24:49.682 kid1| 5,5| comm.cc(644) comm_connect_addr: sock=98, addrinfo( flags=4, family=10, socktype=1, protocol=6, &addr=0xeb79a0, addrlen=28 )
2017/09/08 14:24:49.682 kid1| 5,9| comm.cc(645) comm_connect_addr: connect FD 98: (-1) (110) Connection timed out
2017/09/08 14:24:49.682 kid1| 14,9| comm.cc(646) comm_connect_addr: connecting to: [::1]:60557
2017/09/08 14:24:49.682 kid1| 5,3| comm.cc(868) _comm_close: comm_close: start closing FD 98
2017/09/08 14:24:49.682 kid1| 5,3| comm.cc(540) commUnsetFdTimeout: Remove timeout for FD 98
2017/09/08 14:24:49.682 kid1| 5,5| comm.cc(721) commCallCloseHandlers: commCallCloseHandlers: FD 98
2017/09/08 14:24:49.682 kid1| 5,4| AsyncCall.cc(26) AsyncCall: The AsyncCall comm_close_complete constructed, this=0xeb7e90 [call128]
2017/09/08 14:24:49.682 kid1| 5,4| AsyncCall.cc(93) ScheduleCall: comm.cc(941) will call comm_close_complete(FD 98) [call128]
2017/09/08 14:24:49.682 kid1| WARNING: Cannot run '/usr/libexec/squid/ext_wbinfo_group_acl' process.

The comm TCP timeout is for FD 98. It may or may not be related to the 
wbinfo helper. The trace shown does not include any ipc.cc or fd.cc 
output indicating what that FD is used for. see the below trace.

This trace is more complete:

2017/09/08 14:24:49.682 kid1| 50,3| comm.cc(347) comm_openex: comm_openex: Attempt open socket for: [::1]
2017/09/08 14:24:49.682 kid1| 50,3| comm.cc(388) comm_openex: comm_openex: Opened socket local=[::1] remote=[::] FD 99 flags=1 : family=10, type=1, protocol=0
2017/09/08 14:24:49.682 kid1| 5,5| comm.cc(420) comm_init_opened: local=[::1] remote=[::] FD 99 flags=1 is a new socket
2017/09/08 14:24:49.682 kid1| 51,3| fd.cc(198) fd_open: fd_open() FD 99 ext_wbinfo_group_acl
2017/09/08 14:24:49.682 kid1| 50,6| comm.cc(209) commBind: commBind: bind socket FD 99 to [::1]
2017/09/08 14:24:49.682 kid1| 50,3| comm.cc(347) comm_openex: comm_openex: Attempt open socket for: [::1]
2017/09/08 14:24:49.682 kid1| 50,3| comm.cc(388) comm_openex: comm_openex: Opened socket local=[::1] remote=[::] FD 100 flags=1 : family=10, type=1, protocol=0
2017/09/08 14:24:49.682 kid1| 5,5| comm.cc(420) comm_init_opened: local=[::1] remote=[::] FD 100 flags=1 is a new socket
2017/09/08 14:24:49.682 kid1| 51,3| fd.cc(198) fd_open: fd_open() FD 100 ext_wbinfo_group_acl
2017/09/08 14:24:49.682 kid1| 50,6| comm.cc(209) commBind: commBind: bind socket FD 100 to [::1]
2017/09/08 14:24:49.682 kid1| 54,3| ipc.cc(180) ipcCreate: ipcCreate: prfd FD 100
2017/09/08 14:24:49.682 kid1| 54,3| ipc.cc(181) ipcCreate: ipcCreate: pwfd FD 100
2017/09/08 14:24:49.682 kid1| 54,3| ipc.cc(182) ipcCreate: ipcCreate: crfd FD 99
2017/09/08 14:24:49.682 kid1| 54,3| ipc.cc(183) ipcCreate: ipcCreate: cwfd FD 99
2017/09/08 14:24:49.682 kid1| 54,3| ipc.cc(196) ipcCreate: ipcCreate: FD 100 sockaddr [::1]:56367
2017/09/08 14:24:49.682 kid1| 54,3| ipc.cc(212) ipcCreate: ipcCreate: FD 99 sockaddr [::1]:35643
2017/09/08 14:24:49.682 kid1| 54,3| ipc.cc(222) ipcCreate: ipcCreate: FD 99 listening...
2017/09/08 14:24:49.683 kid1| 5,3| comm.cc(868) _comm_close: comm_close: start closing FD 99
2017/09/08 14:24:49.683 kid1| 5,3| comm.cc(540) commUnsetFdTimeout: Remove timeout for FD 99
2017/09/08 14:24:49.683 kid1| 5,5| comm.cc(721) commCallCloseHandlers: commCallCloseHandlers: FD 99
2017/09/08 14:24:49.683 kid1| 5,4| AsyncCall.cc(26) AsyncCall: The AsyncCall comm_close_complete constructed, this=0xeb7f10 [call129]
2017/09/08 14:24:49.683 kid1| 5,4| AsyncCall.cc(93) ScheduleCall: comm.cc(941) will call comm_close_complete(FD 99) [call129]
2017/09/08 14:24:49.683 kid1| 5,9| comm.cc(602) comm_connect_addr: connecting socket FD 100 to [::1]:35643 (want family: 10)
2017/09/08 14:24:49.683 kid1| 21,3| tools.cc(543) leave_suid: leave_suid: PID 10064 called
2017/09/08 14:24:49.683 kid1| 21,3| tools.cc(636) no_suid: no_suid: PID 10064 giving up root priveleges forever
2017/09/08 14:24:49.683 kid1| 54,3| ipc.cc(304) ipcCreate: ipcCreate: calling accept on FD 99

This ext_wbinfo_group_acl helper on FD 99/100 was started successfully 
AFAIK.

I also tried restarting Winbind, c-icap server, clamd, but nothing changed. They all seemed to be working properly anyway.

Finally, at some point (after half an hour) right after killing all squid processes yet again, I restarted the squid services one last time (reverted to debug_options ALL,1) before giving up and deciding to let users bypass the proxy. Well, it all started working again...

So now I'd really like to know what I can do the next time it stops working like this.

Couple of workarounds.

a) start fewer helpers at a time.

b) reduce cache_mem.

c) add concurrency support to the helpers.

All of the above are aimed at reducing the amount of background 
administrative work Squid has to do for helpers, or the amount of memory 
consumed. And thus reducing the duration and effects of these pauses, 
even if (a) causes them to be slightly more frequent.

I'm considering setting debug_options to ALL,6 while I still can, and wait to see if it fails again. When it does, I might have more information.

The log lines above indicate sections "5,9 50,6 51,3 54,9" are logging 
the helper startup details. You can reduce debug_options to those for 
smaller logs to see if the TCP connect timeout is affecting helper startup.

Amos
_______________________________________________
squid-users mailing list
squid-users@xxxxxxxxxxxxxxxxxxxxx
http://lists.squid-cache.org/listinfo/squid-users