Re: Squid losing connectivity for 30 seconds

Amos Jeffries <squid3@xxxxxxxxxxxxx> · Fri, 02 Dec 2011 16:15:46 +1300

On 2/12/2011 3:16 a.m., Elie Merhej wrote:

 Hi,

I am currently facing a problem that I wasn't able to find a 
solution for in the mailing list or on the internet,
My squid is dying for 30 seconds every one hour at the same 
exact time, squid process will still be running,
I lose my wccp connectivity, the cache peers detect the squid 
as a dead sibling, and the squid cannot server any requests
The network connectivity of the sever is not affected (a ping 
to the squid's ip doesn't timeout)

The problem doesn't start immediately when the squid is 
installed on the server (The server is dedicated as a squid)
It starts when the cache directories starts to fill up,
I have started my setup with 10 cache directors, the squid will 
start having the problem when the cache directories are above 
50% filled
when i change the number of cache directory (9,8,...) the squid 
works for a while then the same problem
cache_dir aufs /cache1/squid 90000 140 256
cache_dir aufs /cache2/squid 90000 140 256
cache_dir aufs /cache3/squid 90000 140 256
cache_dir aufs /cache4/squid 90000 140 256
cache_dir aufs /cache5/squid 90000 140 256
cache_dir aufs /cache6/squid 90000 140 256
cache_dir aufs /cache7/squid 90000 140 256
cache_dir aufs /cache8/squid 90000 140 256
cache_dir aufs /cache9/squid 90000 140 256
cache_dir aufs /cache10/squid 80000 140 256

I have 1 terabyte of storage
Finally I created two cache dircetories (One on each HDD) but 
the problem persisted

You have 2 HDD?  but, but, you have 10 cache_dir.
 We repeatedly say "one cache_dir per disk" or similar. In 
particular one cache_dir per physical drive spindle (for "disks" 
made up of multiple physical spindles) wherever possible with 
physical drives/spindles mounting separately to ensure the 
pairing. Squid performs a very unusual pattern of disk I/O which 
stress them down to the hardware controller level and make this 
kind of detail critical for anything like good speed. Avoiding 
cache_dir object limitations by adding more UFS-based dirs to 
one disk does not improve the situation.

That is a problem which will be affecting your Squid all the 
time though, possibly making the source of the pause worse.

From teh description I believe it is garbage collection on the 
cache directories. The pauses can be visible when garbage 
collecting any caches over a few dozen GB. The squid default 
"swap_high" and "swap_low" values are "5" apart, with at minimum 
being a value of 0 apart. These are whole % points of the total 
cache size, being erased from disk in a somewhat random-access 
style across the cache area. I did mention uncommon disk I/O 
patterns, right?

To be sure what it is, you can use the "strace" tool to the 
squid worker process (the second PID in current stable Squids) 
and see what is running. But given the hourly regularity and 
past experience with others on similar cache sizes, I'm almost 
certain its the garbage collection.

Amos

Hi Amos,

Thank you for your fast reply,
I have 2 HDD (450GB and 600GB)
df -h displays that i have 357Gb and 505GB available
In my last test, my cache dir where:
cache_swap_low 90
cache_swap_high 95

This is not. For anything more than 10-20 GB I recommend setting 
it to no more than 1 apart, possibly the same value if that works.
Squid has a light but CPU-intensive and possibly long garbage 
removal cycle above cache_swap_low, and a much more aggressive but 
faster and less CPU intensive removal above cache_swap_high. On 
large caches it is better in terms of downtime going straight to 
the aggressive removal and clearing disk space fast, despite the 
bandwidth cost replacing any items the light removal would have left.

Amos

Hi Amos,

I have changed the swap_high  90 and swap_low 90 with two cache dir 
(one for each HDD), i still have the same problem,
I did an strace (when the problem occured)
------ ----------- ----------- --------- --------- ----------------
 23.06    0.004769           0     85681        96 write
 21.07    0.004359           0     24658         5 futex
 19.34    0.004001         800         5           open
  6.54    0.001352           0      5101      5101 connect
  6.46    0.001337           3       491           epoll_wait
  5.34    0.001104           0     51938      9453 read
  3.90    0.000806           0     39727           close
  3.54    0.000733           0     86400           epoll_ctl
  3.54    0.000732           0     32357           sendto
  2.02    0.000417           0     56721           recvmsg
  1.84    0.000381           0     24064           socket
  0.96    0.000199           0     56264           fcntl
  0.77    0.000159           0      6366       329 accept
  0.53    0.000109           0     24033           bind
  0.52    0.000108           0     30085           getsockname
  0.21    0.000044           0     11200           stat
  0.21    0.000044           0      6998       359 recvfrom
  0.09    0.000019           0      5085           getsockopt
  0.06    0.000012           0      2887           lseek
  0.00    0.000000           0        98           brk
  0.00    0.000000           0        16           dup2
  0.00    0.000000           0     10314           setsockopt
  0.00    0.000000           0         4           getdents
  0.00    0.000000           0         3           getrusage
------ ----------- ----------- --------- --------- ----------------
100.00    0.020685                560496     15343 total

this is the strace of squid when it is working normally:
------ ----------- ----------- --------- --------- ----------------
 24.88    0.015887           0    455793       169 write
 13.72    0.008764           0    112185           epoll_wait
 11.67    0.007454           0    256256     27158 read
  8.47    0.005408           0    169133           sendto
  6.94    0.004430           0    159596           close
  6.85    0.004373           0    387359           epoll_ctl
  6.42    0.004102           0     19651     19651 connect
  5.54    0.003538           0    290289           recvmsg
  3.81    0.002431           0    116515           socket
  3.53    0.002254           0    164750           futex
  1.68    0.001075           0    207688           fcntl
  1.53    0.000974           0     95228     23139 recvfrom
  1.29    0.000821           0     33408     12259 accept
  1.14    0.000726           0     46582           stat
  1.11    0.000707           0    110826           bind
  0.85    0.000544           0    137574           getsockname
  0.32    0.000204           0     21642           getsockopt
  0.26    0.000165           0     39502           setsockopt
  0.01    0.000007           0      8092           lseek
  0.00    0.000000           0       248           open
  0.00    0.000000           0         4           brk
  0.00    0.000000           0        88           dup2
  0.00    0.000000           0        14           getdents
  0.00    0.000000           0         6           getrusage
------ ----------- ----------- --------- --------- ----------------
100.00    0.063864               2832429     82376 total

Do you have any suggestions to solve the issue, can I run the 
garbage collector more frequently, is it better to change the 
cache_dir type from aufs to something else?
Do you see the problem in the strace?

Thank you,
Elie

Hi,

Please note that squid is facing the same problem even when their is 
no activity or any clients connected to it

Regards
Elie

Hi,

here is the strace result
----------------------------------------------------------------------------------------------------- 

<snip looks perfectly normal traffic, file opening and closing data 
reading, DNS lookups and other network read/writes>
read(165, "!", 256)                     = 1
<snip bunch of other normal traffic>

read(165, "!", 256)                     = 1
---------------------------------------------------------------------------------------------------------------------------------------------------- 

Squid is freezing at this point

The 1-byte read on FD #165 seems odd. Particularly suspicious being just 
before a pause and only having a constant 256 byte buffer space 
available. No ideas what it is yet though.

Here is my compilation options
-------------------------------------------------------------------------------------------------------------------------------------------------------- 

./configure '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' 
'--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share' 
'--includedir=/usr/include' '--libdir=/usr/lib64' 
'--libexecdir=/usr/libexec' '--sharedstatedir=/var/lib' 
'--mandir=/usr/share/man' '--infodir=/usr/share/info' 
'--exec_prefix=/usr' '--libexecdir=/usr/lib64/squid' 
'--localstatedir=/var' '--datadir=/usr/share/squid' 
'--sysconfdir=/etc/squid' '--with-logdir=/var/log/squid' 
'--with-pidfile=/var/run/squid.pid' '--disable-dependency-tracking' 
'--enable-arp-acl' '--enable-follow-x-forwarded-for' 
'--enable-auth=basic,digest,negotiate' 
'--enable-external-acl-helpers=ip_user,unix_group,wbinfo_group' 
'--enable-cache-digests' '--enable-cachemgr-hostname=localhost' 
'--enable-delay-pools' '--enable-epoll' '--enable-icap-client' 
'--enable-ident-lookups' '--enable-linux-netfilter' 
'--enable-referer-log' '--enable-removal-policies=lru' '--enable-snmp' 
'--enable-ssl' '--enable-storeio=aufs,ufs' '--enable-wccpv2' 
'--enable-esi' '--with-aio' '--with-default-user=proxy' 
'--with-filedescriptors=65536' '--with-dl' '--with-pthreads' 
'--with-libcap' '--with-netfilter-conntrack' '--with-openssl' 
'--enable-inline' '--enable-uselect' '--enable-disk-io' 
'--disable-htcp' '--with-gnu-ld' '--with-build-environment=default' 
'--enable-carp' '--enable-async-io=26' --with-squid=/home/squid-3.1.15 
--enable-ltdl-convenience
--------------------------------------------------------------------------------------------------------------------------------------------------------- 

Here is my squid.conf:

Can't help myself, I digress into a config audit... completely off-topic.

--------------------------------------------------------------------------------------------------------------------------------------------------------- 

acl manager proto cache_object
acl localhost src 127.0.0.1/32
acl to_localhost dst 127.0.0.0/8 0.0.0.0/32
acl localnet src 10.0.0.0/8     # RFC1918 possible internal network
acl localnet src 172.16.0.0/12  # RFC1918 possible internal network
acl localnet src 192.168.0.0/16 # RFC1918 possible internal network
acl SSL_ports port 443
acl Safe_ports port 80          # http
acl Safe_ports port 21          # ftp
acl Safe_ports port 443         # https
acl Safe_ports port 70          # gopher
acl Safe_ports port 210         # wais
acl Safe_ports port 1025-65535  # unregistered ports
acl Safe_ports port 280         # http-mgmt
acl Safe_ports port 488         # gss-http
acl Safe_ports port 591         # filemaker
acl Safe_ports port 777         # multiling http
acl CONNECT method CONNECT
acl clients src x.x.x.x
#icp acl
acl squidFarm src x.x.x.x
acl self src x.x.x.x

#ICAP acl
acl icap_port1 myportname 3144
acl icap_port2 myportname 3145
acl icap_port3 myportname 3146
acl icap_port5 myportname 3148

http_access allow manager localhost
http_access deny manager
http_access deny !Safe_ports
http_access deny CONNECT !SSL_ports
http_access allow localnet
http_access allow localhost
#prevent digest loop
http_access deny self
http_access allow clients
http_access deny all
http_reply_access allow all
#icp_access allow all
icp_port 3130
icp_access allow squidFarm
icp_access deny all

http_port 3129 tproxy
http_port 3128 transparent
http_port 3144 tproxy
http_port 3146 tproxy
http_port 3145 tproxy
http_port 3148 tproxy

forwarded_for off
via off
visible_hostname x.x.x.x
hierarchy_stoplist cgi-bin ?
coredump_dir /var/spool/squid
# Image files
refresh_pattern -i \.png$                10080   90%     43200
refresh_pattern -i \.gif$                10080   90%     43200
refresh_pattern -i \.jpg$                10080   90%     43200
refresh_pattern -i \.jpeg$               10080   90%     43200
refresh_pattern -i \.bmp$                10080   90%     43200
refresh_pattern -i \.tif$                10080   90%     43200
refresh_pattern -i \.tiff$               10080   90%     43200

This is *the* most inefficient way to do this.  The refresh_pattern set 
is tested for every single cached object load. On top of that each 
pattern line is an individual regex pettern, which is almost the slowest 
match type Squid can perform.
You will gain in proxy performance by collapsing these regex patterns 
down into one line. Like so:

   refresh_pattern -i \.(png|gif|jpe?g|bmp|tiff?)$  10080 90% 43200

same for the others...

# Compressed files
refresh_pattern -i \.zip$                10080   90%     43200
refresh_pattern -i \.rar$                10080   90%     43200
refresh_pattern -i \.tar$                10080   90%     43200
refresh_pattern -i \.gz$                 10080   90%     43200
refresh_pattern -i \.tgz$                10080   90%     43200
refresh_pattern -i \.z$                  10080   90%     43200
refresh_pattern -i \.arj$                10080   90%     43200
refresh_pattern -i \.lha$                10080   90%     43200
refresh_pattern -i \.lzh$                10080   90%     43200

# Binary files
refresh_pattern -i \.exe$                10080   90%     43200
refresh_pattern -i \.msi$                10080   90%     43200

# Multimedia files
refresh_pattern -i \.mp3$                10080   90%     43200
refresh_pattern -i \.wav$                10080   90%     43200
refresh_pattern -i \.mid$                10080   90%     43200
refresh_pattern -i \.midi$               10080   90%     43200
refresh_pattern -i \.ram$                10080   90%     43200
refresh_pattern -i \.ra$                 10080   90%     43200
refresh_pattern -i \.mov$                10080   90%     43200
refresh_pattern -i \.avi$                10080   90%     43200
refresh_pattern -i \.wmv$                10080   90%     43200
refresh_pattern -i \.mpg$                10080   90%     43200
refresh_pattern -i \.mpg$                10080   90%     43200
refresh_pattern -i \.mpg$                10080   90%     43200
refresh_pattern -i \.mpeg$               10080   90%     43200
refresh_pattern -i \.swf$                10080   90%     43200

# Document files
refresh_pattern -i \.pdf$                10080   90%     43200
refresh_pattern -i \.ps$                 10080   90%     43200
refresh_pattern -i \.doc$                10080   90%     43200
refresh_pattern -i \.ppt$                10080   90%     43200
refresh_pattern -i \.pps$                10080   90%     43200
#windows update refresh paterns
refresh_pattern windowsupdate.com/.*\.(cab|exe|psf) 4320 100% 43200 
reload-into-ims
refresh_pattern download.microsoft.com/.*\.(cab|exe|psf) 4320 100% 
43200 reload-into-ims
refresh_pattern armdl.adobe.com/.*\.(cab|msp|msi) 4320 100% 43200 
reload-into-ims
#default refresh paterns
refresh_pattern ^ftp:           1440    20%     10080
refresh_pattern ^gopher:        1440    0%      1440
refresh_pattern -i (/cgi-bin/|\?) 0     0%      0
refresh_pattern .               0       20%     4320

client_db on
cache_mem 256 MB
cache_swap_low 90
cache_swap_high 90
maximum_object_size 512 MB
maximum_object_size_in_memory 20 KB
cache_dir aufs /cache1/squid 320000 480 256
cache_dir aufs /cache2/squid 480000 700 256

#logformat modified %tl Request time:%tr Status:%Ss/%03>Hs Client:%>a  
URL:%ru  Server:%<A Type:%mt
logformat modified %tl %>a %ru %<A
access_log /var/log/squid/access.log modified
cache_log /var/log/squid/cache.log
cache_store_log none
coredump_dir /var/spool/squid
half_closed_clients off

snmp_port x
acl cacti src x.x.x.x
acl snmpcommunity snmp_community xxxx
snmp_access allow snmpcommunity xxxx
snmp_access allow snmpcommunity localhost
snmp_access deny all

wccp2_router x.x.x.x
wccp2_forwarding_method l2
wccp2_return_method l2
wccp2_service dynamic x
wccp2_service_info x protocol=tcp flags=src_ip_hash priority=240 ports=80
wccp2_service dynamic x
wccp2_service_info x protocol=tcp flags=dst_ip_hash,ports_source 
priority=240 ports=80
wccp2_assignment_method mask

#icp configuration
maximum_icp_query_timeout 30
cache_peer x.x.x.x sibling 3128 3130 proxy-only no-tproxy
cache_peer x.x.x.x sibling 3128 3130 proxy-only no-tproxy
cache_peer x.x.x.x sibling 3128 3130 proxy-only no-tproxy
log_icp_queries off
miss_access allow squidFarm
miss_access deny all

So if I understand this right. You have a layer of proxies defined as 
"squidFarm" which client traffic MUST pass through *first* before they 
are allowed to fetch MISS requests from this proxy.  Yet you are 
receiving WCCP traffic directly at this proxy with both NAT and TPROXY?

This miss_access policy seems decidedly odd. Perhapse you can enlighten me?

# ICAP configuration
icap_enable on
icap_send_client_ip on
icap_send_client_username on
icap_client_username_encode off
icap_client_username_header X-Client-Username
icap_preview_enable on
icap_preview_size 1024

logformat squid %tl %icap::tt %icap::tr %>a %icap::rm %icap::ru
icap_log /var/log/squid/icap.log squid
icap_service service_req reqmod_precache bypass=1 icap://x.x.x.x:x/reqmod
adaptation_access service_req allow icap_port1
icap_service service_req_2 reqmod_precache bypass=1 
icap://x.x.x.x:x/reqmod
adaptation_access service_req_2 allow icap_port2
icap_service service_req_3 reqmod_precache bypass=1 
icap://x.x.x.x:x/reqmod
adaptation_access service_req_3 allow icap_port3
icap_service service_req_5 reqmod_precache bypass=1 
icap://x.x.x.x:x/reqmod
adaptation_access service_req_5 allow icap_port5
--------------------------------------------------------------------------------------------------------------------------------------------------------- 

Please advise,
Thank you
Elie