On 2/12/2011 3:16 a.m., Elie Merhej wrote:
Hi,
I am currently facing a problem that I wasn't able to find a
solution for in the mailing list or on the internet,
My squid is dying for 30 seconds every one hour at the same
exact time, squid process will still be running,
I lose my wccp connectivity, the cache peers detect the squid
as a dead sibling, and the squid cannot server any requests
The network connectivity of the sever is not affected (a ping
to the squid's ip doesn't timeout)
The problem doesn't start immediately when the squid is
installed on the server (The server is dedicated as a squid)
It starts when the cache directories starts to fill up,
I have started my setup with 10 cache directors, the squid will
start having the problem when the cache directories are above
50% filled
when i change the number of cache directory (9,8,...) the squid
works for a while then the same problem
cache_dir aufs /cache1/squid 90000 140 256
cache_dir aufs /cache2/squid 90000 140 256
cache_dir aufs /cache3/squid 90000 140 256
cache_dir aufs /cache4/squid 90000 140 256
cache_dir aufs /cache5/squid 90000 140 256
cache_dir aufs /cache6/squid 90000 140 256
cache_dir aufs /cache7/squid 90000 140 256
cache_dir aufs /cache8/squid 90000 140 256
cache_dir aufs /cache9/squid 90000 140 256
cache_dir aufs /cache10/squid 80000 140 256
I have 1 terabyte of storage
Finally I created two cache dircetories (One on each HDD) but
the problem persisted
You have 2 HDD? but, but, you have 10 cache_dir.
We repeatedly say "one cache_dir per disk" or similar. In
particular one cache_dir per physical drive spindle (for "disks"
made up of multiple physical spindles) wherever possible with
physical drives/spindles mounting separately to ensure the
pairing. Squid performs a very unusual pattern of disk I/O which
stress them down to the hardware controller level and make this
kind of detail critical for anything like good speed. Avoiding
cache_dir object limitations by adding more UFS-based dirs to
one disk does not improve the situation.
That is a problem which will be affecting your Squid all the
time though, possibly making the source of the pause worse.
From teh description I believe it is garbage collection on the
cache directories. The pauses can be visible when garbage
collecting any caches over a few dozen GB. The squid default
"swap_high" and "swap_low" values are "5" apart, with at minimum
being a value of 0 apart. These are whole % points of the total
cache size, being erased from disk in a somewhat random-access
style across the cache area. I did mention uncommon disk I/O
patterns, right?
To be sure what it is, you can use the "strace" tool to the
squid worker process (the second PID in current stable Squids)
and see what is running. But given the hourly regularity and
past experience with others on similar cache sizes, I'm almost
certain its the garbage collection.
Amos
Hi Amos,
Thank you for your fast reply,
I have 2 HDD (450GB and 600GB)
df -h displays that i have 357Gb and 505GB available
In my last test, my cache dir where:
cache_swap_low 90
cache_swap_high 95
This is not. For anything more than 10-20 GB I recommend setting
it to no more than 1 apart, possibly the same value if that works.
Squid has a light but CPU-intensive and possibly long garbage
removal cycle above cache_swap_low, and a much more aggressive but
faster and less CPU intensive removal above cache_swap_high. On
large caches it is better in terms of downtime going straight to
the aggressive removal and clearing disk space fast, despite the
bandwidth cost replacing any items the light removal would have left.
Amos
Hi Amos,
I have changed the swap_high 90 and swap_low 90 with two cache dir
(one for each HDD), i still have the same problem,
I did an strace (when the problem occured)
------ ----------- ----------- --------- --------- ----------------
23.06 0.004769 0 85681 96 write
21.07 0.004359 0 24658 5 futex
19.34 0.004001 800 5 open
6.54 0.001352 0 5101 5101 connect
6.46 0.001337 3 491 epoll_wait
5.34 0.001104 0 51938 9453 read
3.90 0.000806 0 39727 close
3.54 0.000733 0 86400 epoll_ctl
3.54 0.000732 0 32357 sendto
2.02 0.000417 0 56721 recvmsg
1.84 0.000381 0 24064 socket
0.96 0.000199 0 56264 fcntl
0.77 0.000159 0 6366 329 accept
0.53 0.000109 0 24033 bind
0.52 0.000108 0 30085 getsockname
0.21 0.000044 0 11200 stat
0.21 0.000044 0 6998 359 recvfrom
0.09 0.000019 0 5085 getsockopt
0.06 0.000012 0 2887 lseek
0.00 0.000000 0 98 brk
0.00 0.000000 0 16 dup2
0.00 0.000000 0 10314 setsockopt
0.00 0.000000 0 4 getdents
0.00 0.000000 0 3 getrusage
------ ----------- ----------- --------- --------- ----------------
100.00 0.020685 560496 15343 total
this is the strace of squid when it is working normally:
------ ----------- ----------- --------- --------- ----------------
24.88 0.015887 0 455793 169 write
13.72 0.008764 0 112185 epoll_wait
11.67 0.007454 0 256256 27158 read
8.47 0.005408 0 169133 sendto
6.94 0.004430 0 159596 close
6.85 0.004373 0 387359 epoll_ctl
6.42 0.004102 0 19651 19651 connect
5.54 0.003538 0 290289 recvmsg
3.81 0.002431 0 116515 socket
3.53 0.002254 0 164750 futex
1.68 0.001075 0 207688 fcntl
1.53 0.000974 0 95228 23139 recvfrom
1.29 0.000821 0 33408 12259 accept
1.14 0.000726 0 46582 stat
1.11 0.000707 0 110826 bind
0.85 0.000544 0 137574 getsockname
0.32 0.000204 0 21642 getsockopt
0.26 0.000165 0 39502 setsockopt
0.01 0.000007 0 8092 lseek
0.00 0.000000 0 248 open
0.00 0.000000 0 4 brk
0.00 0.000000 0 88 dup2
0.00 0.000000 0 14 getdents
0.00 0.000000 0 6 getrusage
------ ----------- ----------- --------- --------- ----------------
100.00 0.063864 2832429 82376 total
Do you have any suggestions to solve the issue, can I run the
garbage collector more frequently, is it better to change the
cache_dir type from aufs to something else?
Do you see the problem in the strace?
Thank you,
Elie
Hi,
Please note that squid is facing the same problem even when their is
no activity or any clients connected to it
Regards
Elie
Hi,
here is the strace result
-----------------------------------------------------------------------------------------------------
<snip looks perfectly normal traffic, file opening and closing data
reading, DNS lookups and other network read/writes>
read(165, "!", 256) = 1
<snip bunch of other normal traffic>
read(165, "!", 256) = 1
----------------------------------------------------------------------------------------------------------------------------------------------------
Squid is freezing at this point
The 1-byte read on FD #165 seems odd. Particularly suspicious being just
before a pause and only having a constant 256 byte buffer space
available. No ideas what it is yet though.
Here is my compilation options
--------------------------------------------------------------------------------------------------------------------------------------------------------
./configure '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin'
'--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share'
'--includedir=/usr/include' '--libdir=/usr/lib64'
'--libexecdir=/usr/libexec' '--sharedstatedir=/var/lib'
'--mandir=/usr/share/man' '--infodir=/usr/share/info'
'--exec_prefix=/usr' '--libexecdir=/usr/lib64/squid'
'--localstatedir=/var' '--datadir=/usr/share/squid'
'--sysconfdir=/etc/squid' '--with-logdir=/var/log/squid'
'--with-pidfile=/var/run/squid.pid' '--disable-dependency-tracking'
'--enable-arp-acl' '--enable-follow-x-forwarded-for'
'--enable-auth=basic,digest,negotiate'
'--enable-external-acl-helpers=ip_user,unix_group,wbinfo_group'
'--enable-cache-digests' '--enable-cachemgr-hostname=localhost'
'--enable-delay-pools' '--enable-epoll' '--enable-icap-client'
'--enable-ident-lookups' '--enable-linux-netfilter'
'--enable-referer-log' '--enable-removal-policies=lru' '--enable-snmp'
'--enable-ssl' '--enable-storeio=aufs,ufs' '--enable-wccpv2'
'--enable-esi' '--with-aio' '--with-default-user=proxy'
'--with-filedescriptors=65536' '--with-dl' '--with-pthreads'
'--with-libcap' '--with-netfilter-conntrack' '--with-openssl'
'--enable-inline' '--enable-uselect' '--enable-disk-io'
'--disable-htcp' '--with-gnu-ld' '--with-build-environment=default'
'--enable-carp' '--enable-async-io=26' --with-squid=/home/squid-3.1.15
--enable-ltdl-convenience
---------------------------------------------------------------------------------------------------------------------------------------------------------
Here is my squid.conf:
Can't help myself, I digress into a config audit... completely off-topic.
---------------------------------------------------------------------------------------------------------------------------------------------------------
acl manager proto cache_object
acl localhost src 127.0.0.1/32
acl to_localhost dst 127.0.0.0/8 0.0.0.0/32
acl localnet src 10.0.0.0/8 # RFC1918 possible internal network
acl localnet src 172.16.0.0/12 # RFC1918 possible internal network
acl localnet src 192.168.0.0/16 # RFC1918 possible internal network
acl SSL_ports port 443
acl Safe_ports port 80 # http
acl Safe_ports port 21 # ftp
acl Safe_ports port 443 # https
acl Safe_ports port 70 # gopher
acl Safe_ports port 210 # wais
acl Safe_ports port 1025-65535 # unregistered ports
acl Safe_ports port 280 # http-mgmt
acl Safe_ports port 488 # gss-http
acl Safe_ports port 591 # filemaker
acl Safe_ports port 777 # multiling http
acl CONNECT method CONNECT
acl clients src x.x.x.x
#icp acl
acl squidFarm src x.x.x.x
acl self src x.x.x.x
#ICAP acl
acl icap_port1 myportname 3144
acl icap_port2 myportname 3145
acl icap_port3 myportname 3146
acl icap_port5 myportname 3148
http_access allow manager localhost
http_access deny manager
http_access deny !Safe_ports
http_access deny CONNECT !SSL_ports
http_access allow localnet
http_access allow localhost
#prevent digest loop
http_access deny self
http_access allow clients
http_access deny all
http_reply_access allow all
#icp_access allow all
icp_port 3130
icp_access allow squidFarm
icp_access deny all
http_port 3129 tproxy
http_port 3128 transparent
http_port 3144 tproxy
http_port 3146 tproxy
http_port 3145 tproxy
http_port 3148 tproxy
forwarded_for off
via off
visible_hostname x.x.x.x
hierarchy_stoplist cgi-bin ?
coredump_dir /var/spool/squid
# Image files
refresh_pattern -i \.png$ 10080 90% 43200
refresh_pattern -i \.gif$ 10080 90% 43200
refresh_pattern -i \.jpg$ 10080 90% 43200
refresh_pattern -i \.jpeg$ 10080 90% 43200
refresh_pattern -i \.bmp$ 10080 90% 43200
refresh_pattern -i \.tif$ 10080 90% 43200
refresh_pattern -i \.tiff$ 10080 90% 43200
This is *the* most inefficient way to do this. The refresh_pattern set
is tested for every single cached object load. On top of that each
pattern line is an individual regex pettern, which is almost the slowest
match type Squid can perform.
You will gain in proxy performance by collapsing these regex patterns
down into one line. Like so:
refresh_pattern -i \.(png|gif|jpe?g|bmp|tiff?)$ 10080 90% 43200
same for the others...
# Compressed files
refresh_pattern -i \.zip$ 10080 90% 43200
refresh_pattern -i \.rar$ 10080 90% 43200
refresh_pattern -i \.tar$ 10080 90% 43200
refresh_pattern -i \.gz$ 10080 90% 43200
refresh_pattern -i \.tgz$ 10080 90% 43200
refresh_pattern -i \.z$ 10080 90% 43200
refresh_pattern -i \.arj$ 10080 90% 43200
refresh_pattern -i \.lha$ 10080 90% 43200
refresh_pattern -i \.lzh$ 10080 90% 43200
# Binary files
refresh_pattern -i \.exe$ 10080 90% 43200
refresh_pattern -i \.msi$ 10080 90% 43200
# Multimedia files
refresh_pattern -i \.mp3$ 10080 90% 43200
refresh_pattern -i \.wav$ 10080 90% 43200
refresh_pattern -i \.mid$ 10080 90% 43200
refresh_pattern -i \.midi$ 10080 90% 43200
refresh_pattern -i \.ram$ 10080 90% 43200
refresh_pattern -i \.ra$ 10080 90% 43200
refresh_pattern -i \.mov$ 10080 90% 43200
refresh_pattern -i \.avi$ 10080 90% 43200
refresh_pattern -i \.wmv$ 10080 90% 43200
refresh_pattern -i \.mpg$ 10080 90% 43200
refresh_pattern -i \.mpg$ 10080 90% 43200
refresh_pattern -i \.mpg$ 10080 90% 43200
refresh_pattern -i \.mpeg$ 10080 90% 43200
refresh_pattern -i \.swf$ 10080 90% 43200
# Document files
refresh_pattern -i \.pdf$ 10080 90% 43200
refresh_pattern -i \.ps$ 10080 90% 43200
refresh_pattern -i \.doc$ 10080 90% 43200
refresh_pattern -i \.ppt$ 10080 90% 43200
refresh_pattern -i \.pps$ 10080 90% 43200
#windows update refresh paterns
refresh_pattern windowsupdate.com/.*\.(cab|exe|psf) 4320 100% 43200
reload-into-ims
refresh_pattern download.microsoft.com/.*\.(cab|exe|psf) 4320 100%
43200 reload-into-ims
refresh_pattern armdl.adobe.com/.*\.(cab|msp|msi) 4320 100% 43200
reload-into-ims
#default refresh paterns
refresh_pattern ^ftp: 1440 20% 10080
refresh_pattern ^gopher: 1440 0% 1440
refresh_pattern -i (/cgi-bin/|\?) 0 0% 0
refresh_pattern . 0 20% 4320
client_db on
cache_mem 256 MB
cache_swap_low 90
cache_swap_high 90
maximum_object_size 512 MB
maximum_object_size_in_memory 20 KB
cache_dir aufs /cache1/squid 320000 480 256
cache_dir aufs /cache2/squid 480000 700 256
#logformat modified %tl Request time:%tr Status:%Ss/%03>Hs Client:%>a
URL:%ru Server:%<A Type:%mt
logformat modified %tl %>a %ru %<A
access_log /var/log/squid/access.log modified
cache_log /var/log/squid/cache.log
cache_store_log none
coredump_dir /var/spool/squid
half_closed_clients off
snmp_port x
acl cacti src x.x.x.x
acl snmpcommunity snmp_community xxxx
snmp_access allow snmpcommunity xxxx
snmp_access allow snmpcommunity localhost
snmp_access deny all
wccp2_router x.x.x.x
wccp2_forwarding_method l2
wccp2_return_method l2
wccp2_service dynamic x
wccp2_service_info x protocol=tcp flags=src_ip_hash priority=240 ports=80
wccp2_service dynamic x
wccp2_service_info x protocol=tcp flags=dst_ip_hash,ports_source
priority=240 ports=80
wccp2_assignment_method mask
#icp configuration
maximum_icp_query_timeout 30
cache_peer x.x.x.x sibling 3128 3130 proxy-only no-tproxy
cache_peer x.x.x.x sibling 3128 3130 proxy-only no-tproxy
cache_peer x.x.x.x sibling 3128 3130 proxy-only no-tproxy
log_icp_queries off
miss_access allow squidFarm
miss_access deny all
So if I understand this right. You have a layer of proxies defined as
"squidFarm" which client traffic MUST pass through *first* before they
are allowed to fetch MISS requests from this proxy. Yet you are
receiving WCCP traffic directly at this proxy with both NAT and TPROXY?
This miss_access policy seems decidedly odd. Perhapse you can enlighten me?
# ICAP configuration
icap_enable on
icap_send_client_ip on
icap_send_client_username on
icap_client_username_encode off
icap_client_username_header X-Client-Username
icap_preview_enable on
icap_preview_size 1024
logformat squid %tl %icap::tt %icap::tr %>a %icap::rm %icap::ru
icap_log /var/log/squid/icap.log squid
icap_service service_req reqmod_precache bypass=1 icap://x.x.x.x:x/reqmod
adaptation_access service_req allow icap_port1
icap_service service_req_2 reqmod_precache bypass=1
icap://x.x.x.x:x/reqmod
adaptation_access service_req_2 allow icap_port2
icap_service service_req_3 reqmod_precache bypass=1
icap://x.x.x.x:x/reqmod
adaptation_access service_req_3 allow icap_port3
icap_service service_req_5 reqmod_precache bypass=1
icap://x.x.x.x:x/reqmod
adaptation_access service_req_5 allow icap_port5
---------------------------------------------------------------------------------------------------------------------------------------------------------
Please advise,
Thank you
Elie