Hi,
I am currently facing a problem that I wasn't able to find a
solution for in the mailing list or on the internet,
My squid is dying for 30 seconds every one hour at the same exact
time, squid process will still be running,
I lose my wccp connectivity, the cache peers detect the squid as
a dead sibling, and the squid cannot server any requests
The network connectivity of the sever is not affected (a ping to
the squid's ip doesn't timeout)
The problem doesn't start immediately when the squid is installed
on the server (The server is dedicated as a squid)
It starts when the cache directories starts to fill up,
I have started my setup with 10 cache directors, the squid will
start having the problem when the cache directories are above 50%
filled
when i change the number of cache directory (9,8,...) the squid
works for a while then the same problem
cache_dir aufs /cache1/squid 90000 140 256
cache_dir aufs /cache2/squid 90000 140 256
cache_dir aufs /cache3/squid 90000 140 256
cache_dir aufs /cache4/squid 90000 140 256
cache_dir aufs /cache5/squid 90000 140 256
cache_dir aufs /cache6/squid 90000 140 256
cache_dir aufs /cache7/squid 90000 140 256
cache_dir aufs /cache8/squid 90000 140 256
cache_dir aufs /cache9/squid 90000 140 256
cache_dir aufs /cache10/squid 80000 140 256
I have 1 terabyte of storage
Finally I created two cache dircetories (One on each HDD) but the
problem persisted
You have 2 HDD? but, but, you have 10 cache_dir.
We repeatedly say "one cache_dir per disk" or similar. In
particular one cache_dir per physical drive spindle (for "disks"
made up of multiple physical spindles) wherever possible with
physical drives/spindles mounting separately to ensure the
pairing. Squid performs a very unusual pattern of disk I/O which
stress them down to the hardware controller level and make this
kind of detail critical for anything like good speed. Avoiding
cache_dir object limitations by adding more UFS-based dirs to one
disk does not improve the situation.
That is a problem which will be affecting your Squid all the time
though, possibly making the source of the pause worse.
From teh description I believe it is garbage collection on the
cache directories. The pauses can be visible when garbage
collecting any caches over a few dozen GB. The squid default
"swap_high" and "swap_low" values are "5" apart, with at minimum
being a value of 0 apart. These are whole % points of the total
cache size, being erased from disk in a somewhat random-access
style across the cache area. I did mention uncommon disk I/O
patterns, right?
To be sure what it is, you can use the "strace" tool to the squid
worker process (the second PID in current stable Squids) and see
what is running. But given the hourly regularity and past
experience with others on similar cache sizes, I'm almost certain
its the garbage collection.
Amos
Hi Amos,
Thank you for your fast reply,
I have 2 HDD (450GB and 600GB)
df -h displays that i have 357Gb and 505GB available
In my last test, my cache dir where:
cache_swap_low 90
cache_swap_high 95
This is not. For anything more than 10-20 GB I recommend setting it
to no more than 1 apart, possibly the same value if that works.
Squid has a light but CPU-intensive and possibly long garbage
removal cycle above cache_swap_low, and a much more aggressive but
faster and less CPU intensive removal above cache_swap_high. On
large caches it is better in terms of downtime going straight to the
aggressive removal and clearing disk space fast, despite the
bandwidth cost replacing any items the light removal would have left.
Amos
Hi Amos,
I have changed the swap_high 90 and swap_low 90 with two cache dir
(one for each HDD), i still have the same problem,
I did an strace (when the problem occured)
------ ----------- ----------- --------- --------- ----------------
23.06 0.004769 0 85681 96 write
21.07 0.004359 0 24658 5 futex
19.34 0.004001 800 5 open
6.54 0.001352 0 5101 5101 connect
6.46 0.001337 3 491 epoll_wait
5.34 0.001104 0 51938 9453 read
3.90 0.000806 0 39727 close
3.54 0.000733 0 86400 epoll_ctl
3.54 0.000732 0 32357 sendto
2.02 0.000417 0 56721 recvmsg
1.84 0.000381 0 24064 socket
0.96 0.000199 0 56264 fcntl
0.77 0.000159 0 6366 329 accept
0.53 0.000109 0 24033 bind
0.52 0.000108 0 30085 getsockname
0.21 0.000044 0 11200 stat
0.21 0.000044 0 6998 359 recvfrom
0.09 0.000019 0 5085 getsockopt
0.06 0.000012 0 2887 lseek
0.00 0.000000 0 98 brk
0.00 0.000000 0 16 dup2
0.00 0.000000 0 10314 setsockopt
0.00 0.000000 0 4 getdents
0.00 0.000000 0 3 getrusage
------ ----------- ----------- --------- --------- ----------------
100.00 0.020685 560496 15343 total
this is the strace of squid when it is working normally:
------ ----------- ----------- --------- --------- ----------------
24.88 0.015887 0 455793 169 write
13.72 0.008764 0 112185 epoll_wait
11.67 0.007454 0 256256 27158 read
8.47 0.005408 0 169133 sendto
6.94 0.004430 0 159596 close
6.85 0.004373 0 387359 epoll_ctl
6.42 0.004102 0 19651 19651 connect
5.54 0.003538 0 290289 recvmsg
3.81 0.002431 0 116515 socket
3.53 0.002254 0 164750 futex
1.68 0.001075 0 207688 fcntl
1.53 0.000974 0 95228 23139 recvfrom
1.29 0.000821 0 33408 12259 accept
1.14 0.000726 0 46582 stat
1.11 0.000707 0 110826 bind
0.85 0.000544 0 137574 getsockname
0.32 0.000204 0 21642 getsockopt
0.26 0.000165 0 39502 setsockopt
0.01 0.000007 0 8092 lseek
0.00 0.000000 0 248 open
0.00 0.000000 0 4 brk
0.00 0.000000 0 88 dup2
0.00 0.000000 0 14 getdents
0.00 0.000000 0 6 getrusage
------ ----------- ----------- --------- --------- ----------------
100.00 0.063864 2832429 82376 total
Do you have any suggestions to solve the issue, can I run the garbage
collector more frequently, is it better to change the cache_dir type
from aufs to something else?
Do you see the problem in the strace?
Thank you,
Elie
Hi,
Please note that squid is facing the same problem even when their is
no activity or any clients connected to it
Regards
Elie
Hi,
here is the strace result
-----------------------------------------------------------------------------------------------------
connect(7595, {sa_family=AF_INET, sin_port=htons(80),
sin_addr=inet_addr("2.22.50.75")}, 16) = -1 EINPROGRESS (Operation now
in progress)
epoll_ctl(6, EPOLL_CTL_ADD, 7595, {EPOLLOUT|EPOLLERR|EPOLLHUP,
{u32=7595, u64=13133095884688465323}}) = 0
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1) = 1
write(35,
"\2\0\0\0\375\254\21\0002r\316N\0\0\0\0002r\316N\0\0\0\0\262\245\257P\0\0\0\0"...,
72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1) = 1
write(35,
"\2\0\0\0\306\254\21\0000r\316N\0\0\0\0000r\316N\0\0\0\0\377\377\377\377\377\377\377\377"...,
72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1) = 1
write(35,
"\2\0\0\0\262\251\21\0\360\376\315N\0\0\0\0002r\316N\0\0\0\0\377\377\377\377\377\377\377\377"...,
72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1) = 1
write(35,
"\2\0\0\0\2\255\21\0002r\316N\0\0\0\0003r\316N\0\0\0\0\321\376\316N\0\0\0\0"...,
72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1) = 1
write(35,
"\2\0\0\0\3\255\21\0002r\316N\0\0\0\0003r\316N\0\0\0\0002\347\340N\0\0\0\0"...,
72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1) = 1
write(35,
"\2\0\0\0\371\254\21\0002r\316N\0\0\0\0002r\316N\0\0\0\0\377\377\377\377\377\377\377\377"...,
72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1) = 1
write(35,
"\2\0\0\0\n\255\21\0003r\316N\0\0\0\0003r\316N\0\0\0\0\263\245\257P\0\0\0\0"...,
72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1) = 1
write(35,
"\2\0\0\0\v\255\21\0003r\316N\0\0\0\0003r\316N\0\0\0\0\323\370\317N\0\0\0\0"...,
72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1) = 1
write(35,
"\2\0\0\0\f\255\21\0003r\316N\0\0\0\0003r\316N\0\0\0\0003\377\365N\0\0\0\0"...,
72) = 72
futex(0x9f65ec, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9f65e8, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9f65c0, FUTEX_WAKE_PRIVATE, 1) = 1
write(35,
"\2\0\0\0\334\251\21\0@\\\316N\0\0\0\0003r\316N\0\0\0\0\377\377\377\377\377\377\377\377"...,
72) = 72
read(165, "!", 256) = 1
close(379) = 0
epoll_ctl(6, EPOLL_CTL_DEL, 9525, {0, {u32=9525,
u64=25160128873375029}}) = 0
epoll_ctl(6, EPOLL_CTL_ADD, 9525, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=9525,
u64=13133094175291483445}}) = 0
epoll_ctl(6, EPOLL_CTL_MOD, 9525, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP,
{u32=9525, u64=23686740342482229}}) = 0
epoll_ctl(6, EPOLL_CTL_MOD, 4554, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP,
{u32=4554, u64=549755818442}}) = 0
close(68) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 68
fcntl(68, F_GETFD) = 0
fcntl(68, F_SETFD, FD_CLOEXEC) = 0
setsockopt(68, SOL_IP, 0x13 /* IP_??? */, [1], 4) = 0
bind(68, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr("95.141.62.17")}, 16) = 0
fcntl(68, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(68, F_SETFL, O_RDWR|O_NONBLOCK) = 0
setsockopt(68, SOL_TCP, TCP_NODELAY, [1], 4) = 0
getsockname(68, {sa_family=AF_INET, sin_port=htons(60303),
sin_addr=inet_addr("95.141.62.17")}, [16]) = 0
socket(PF_NETLINK, SOCK_RAW, 0) = 379
bind(379, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0
getsockname(379, {sa_family=AF_NETLINK, pid=11594, groups=00000000},
[12]) = 0
sendto(379, "\24\0\0\0\26\0\1\3t\232\324N\0\0\0\0\0\0\0\0", 20, 0,
{sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 20
recvmsg(379, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
groups=00000000},
msg_iov(1)=[{"0\0\0\0\24\0\2\0t\232\324NJ-\0\0\2\10\200\376\1\0\0\0\10\0\1\0\177\0\0\1"...,
4096}], msg_controllen=0, msg_flags=0}, 0) = 108
recvmsg(379, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
groups=00000000},
msg_iov(1)=[{"@\0\0\0\24\0\2\0t\232\324NJ-\0\0\n\200\200\376\1\0\0\0\24\0\1\0\0\0\0\0"...,
4096}], msg_controllen=0, msg_flags=0}, 0) = 128
recvmsg(379, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
groups=00000000},
msg_iov(1)=[{"\24\0\0\0\3\0\2\0t\232\324NJ-\0\0\0\0\0\0\1\0\0\0\24\0\1\0\0\0\0\0"...,
4096}], msg_controllen=0, msg_flags=0}, 0) = 20
close(379) = 0
connect(68, {sa_family=AF_INET, sin_port=htons(80),
sin_addr=inet_addr("2.22.50.65")}, 16) = -1 EINPROGRESS (Operation now
in progress)
epoll_ctl(6, EPOLL_CTL_ADD, 68, {EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=68,
u64=13133092173836714052}}) = 0
epoll_ctl(6, EPOLL_CTL_MOD, 3658, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP,
{u32=3658, u64=549755817546}}) = 0
sendto(10, "\33\221\1\0\0\1\0\0\0\0\0\0\nliveupdate\vgocyberl"..., 44,
0, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 44
sendto(10, "\275\251\1\0\0\1\0\0\0\0\0\0\2tc\2v3\6cache3\1c\4pack"...,
48, 0, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 48
sendto(10, "\267\355\1\0\0\1\0\0\0\0\0\0\3crl\6thawte\3com\0\0\34\0\1",
32, 0, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 32
sendto(10, "\312\237\1\0\0\1\0\0\0\0\0\0\2p4\rbhzfrwddqtb66\20cg"...,
79, 0, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 79
sendto(10, "\261\216\1\0\0\1\0\0\0\0\0\0\4apps\10facebook\3com\0\0"...,
35, 0, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 35
sendto(10, "\237\10\1\0\0\1\0\0\0\0\0\0\6sn121w\6snt121\4mail\4"..., 45,
0, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 45
sendto(10, "FR\1\0\0\1\0\0\0\0\0\0\3api\6js-kit\3com\0\0\1\0\1", 32, 0,
{sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 32
sendto(10, "\207$\1\0\0\1\0\0\0\0\0\0\1l\naddthiscdn\3com\0\0\1"..., 34,
0, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 34
sendto(10, "\265#\1\0\0\1\0\0\0\0\0\0\5usage\7hosting\7toolb"..., 60, 0,
{sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 60
sendto(10, "hZ\1\0\0\1\0\0\0\0\0\0\3www\ntargetgulf\3com\0"..., 36, 0,
{sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 36
sendto(10, "\355\345\1\0\0\1\0\0\0\0\0\0\10services\4apps\7condu"...,
43, 0, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 43
sendto(10, "O\254\1\0\0\1\0\0\0\0\0\0\0060-jw-w\7channel\10face"..., 45,
0, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 45
sendto(10, "\330\310\1\0\0\1\0\0\0\0\0\0\2t1\7gstatic\3com\0\0\1\0\1",
32, 0, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 32
sendto(10, "\16Z\1\0\0\1\0\0\0\0\0\0\4mail\3aol\3com\0\0\1\0\1", 30, 0,
{sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 30
sendto(10, "\327\216\1\0\0\1\0\0\0\0\0\0\3www\10bisara7a\3com\0\0\1"...,
34, 0, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 34
sendto(10, "o\201\1\0\0\1\0\0\0\0\0\0\003317\7channel\10faceboo"..., 42,
0, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 42
sendto(10, "\344?\1\0\0\1\0\0\0\0\0\0\7email03\fsecureserve"..., 42, 0,
{sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.249.209.102")}, 16) = 42
sendto(21,
"\0\0\0\n\2\0\0|\0\0\0\4\0\0\0\0\0\1\0\30\0017\360\6\0\0\0002\0P\0\0"..., 132,
0, NULL, 0) = 132
sendto(21,
"\0\0\0\n\2\0\0|\0\0\0\4\0\0\0\0\0\1\0\30\1<\360\6\0\0\0\21\0P\0\0"...,
132, 0, NULL, 0) = 132
open("/cache2/pool6/squid/13/AB",
O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 379
getdents(379, /* 254 entries */, 32768) = 8112
getdents(379, /* 0 entries */, 32768) = 0
close(379) = 0
read(165, "!", 256) = 1
----------------------------------------------------------------------------------------------------------------------------------------------------
Squid is freezing at this point
Please advise
Thank you
Elie