Re: squid consuming near all (95+ %) CPU, it is normal?

Amos Jeffries <squid3@xxxxxxxxxxxxx> · Tue, 20 Nov 2012 11:19:28 +1300

On 20.11.2012 09:33, Frantisek Hanzlik wrote:
Amos Jeffries wrote:
On 12.11.2012 03:25, Frantisek Hanzlik wrote:
With this squid configuration:
acl localnet src 172.16.0.0/12
acl localnet src 192.168.0.0/16
acl SSL_ports port 443
acl SSL_ports port 85
acl SSL_ports port 81
acl SSL_ports port 5443
acl Safe_ports port 80
acl Safe_ports port 21
acl Safe_ports port 443
acl Safe_ports port 70
acl Safe_ports port 210
acl Safe_ports port 1025-65535
acl Safe_ports port 280
acl Safe_ports port 488
acl Safe_ports port 591
acl Safe_ports port 777
acl Safe_ports port 5443
acl Safe_ports port 85
acl Safe_ports port 81
acl CONNECT method CONNECT
http_access allow manager localhost
http_access deny manager
http_access deny !Safe_ports
http_access deny CONNECT !SSL_ports
http_access allow localnet
http_access allow localhost
http_access deny all
http_port 3128
hierarchy_stoplist cgi-bin ?
cache_dir ufs /var/spool/squid 1000 16 256 max-size=999000
cache_mem 512 MB
maximum_object_size 4096 KB
memory_pools off
cache_swap_low 90
cache_swap_high 95
dns_nameservers 172.16.1.1
client_db off
half_closed_clients off
max_filedesc 4096
coredump_dir /var/spool/squid
refresh_pattern ^ftp:           1440    20%     10080
refresh_pattern ^gopher:        1440    0%      1440
refresh_pattern -i (/cgi-bin/|\?) 0     0%      0
refresh_pattern .               0       20%     4320
acl users  src 172.31.0.0/16
delay_pools 1
delay_class 1 2
delay_parameters 1 5000000/10000000 5000000/10000000
delay_access 1 allow users

squid very often load CPU at near 100%, with cca 200 users and 4000
connections (~2000 to users, 2000 to internet). Removing delay pool
configuration has no big effect.

Users and connections is meaningless in HTTP. Requests per second 
flowing
over those connections is what counts.

The proxy might have all 4000 links idle (low CPU; zero bandwidth; 
zero
disk I/O), be downloading video (or .iso) images simultaneously (low 
CPU;
maxed out bandwidth; high disk I/O), or parsing and processing
header-only requests (100% CPU; moderate or low bandwidth; no disk 
I/O).

NP: 3.2 uses HTTP/1.1. A lot of the protocol performance features in
HTTP/1.1 are done by removing object bodies and reducing the 
transactions
to header-only requests and responses.

HW configuration: Dual core E8500@3.16GHz CPU, 4GB RAM, 2x SATA 7k2
Raid Edition disks in SW RAID1 for squid cache (disk performance
seems isn't problem, IOWAIT is small), gigabit ethernet cards to
internet (~800 Mbps line) and to LAN.
It is squid-3.2.3.20121106.r11695-1.fc14.i686 on Fedora 14 i686
(I test it with some older squid 3.1 version and same configuration
too, but results were same, or rather worse)

It this CPU load normal, or can be there done some performance
tunnning for it?

In order of likelyhood:

Experiment #1 is remove that SW RAID and test again.
Sure iowait is not bad (um, to *master* disk it is same as accessing
without RAID at all), however iowait is only half the story with SW 
RAID.
Being SW every I/O op sucks away some ... CPU. Disk I/O CPU loading 
in
particular can be doubled depending on the implementations buffering
efficiency.
 If you need RAID at all use HW RAID instead of SW. The only benefit 
you
get from RAID under Squid is some advance notice of disks failing 
before
the proxy crashes (or starts TCP_SWAPFAIL_MISS'ing - but UFS cache 
type
still crashes so maybe you do need RAID).
 By using SW RAID in particular you are taking CPU cycles away from
Squid, which would otherwise be using them to process a higher 
req/sec
peak capacity. If your peak traffic is low req/sec this is not a 
problem,
but for several thousand users I expect your peak capacity needs are 
high.
 Choice is yours, but I seriously advise moving away from SW RAID.

Experiment #2 is to set "memory_pools on".
Up to 90% of malloc/free calls are for very short strings and small
objects. Squid can save on a lot of system allocator cycles and 
shrink
the overall system RAM requirements a little bit by allocating these 
in
batches/blocks/pools. This will help speed up req/sec capacity in 
the
traffic which consists of mostly HTTP headers.

Experiment #3 is to raise max_filedesc
I suspect this setting is the only thing limiting your proxy to 2000 
user
connections and 2000 internet connections. Notice that makes 4000, 
and
with 100 connections held in reserve for unexpected disk access it 
would
seem that you are not using disks at all for many of these 
connections
(TCP_MISS and TCP_MEM_HIT being most of your load?).
 When there are not enough FD to service all incoming requests Squid
starts limiting them and spends extra CPU cycles doing management of 
the
waiting client connection queue.

Experiment #4 is to set "cache_swap_low 97" and "cache_swap_high 
98".
 Periodic garbage collection happens when the cache fills up. With a 
1GB
disk cache the default settings make about 50-100MB of objects being
processed in the cache and erased from disk. Most objects will be 
only a
few KB - see your avg object size stats.
 NOTE: this may only marginally appear in iowait, but shows up 
better in
the related erase/unlink operation stats.

Amos

Thank You for recommendations. I think #3 perhaps isn't problem 
there,
as squids cache.log says there is 16384 filedescriptors (i do not
understand why it is this number, in startup script i'm setting
'ulimit -HSn 65536'
just before starting squid - then why no. of fd is only 16384? Some 
other
system limitations?):

2012/11/17 08:40:29 kid1| Starting Squid Cache version
3.2.3-20121106-r11695 for i686-pc-linux-gnu...
2012/11/17 08:40:29 kid1| Process ID 2497
2012/11/17 08:40:29 kid1| Process Roles: worker
2012/11/17 08:40:29 kid1| With 16384 file descriptors available
2012/11/17 08:40:29 kid1| Initializing IP Cache...
2012/11/17 08:40:29 kid1| DNS Socket created at 0.0.0.0, FD 9
2012/11/17 08:40:29 kid1| Adding nameserver 172.16.1.1 from 
squid.conf
2012/11/17 08:40:29 kid1| Logfile: opening log
daemon:/var/log/squid/access.log
2012/11/17 08:40:29 kid1| Logfile Daemon: opening log
/var/log/squid/access.log
2012/11/17 08:40:29 kid1| Unlinkd pipe opened on FD 15
2012/11/17 08:40:29 kid1| Local cache digest enabled; rebuild/rewrite
every 3600/3600 sec
2012/11/17 08:40:29 kid1| Store logging disabled
2012/11/17 08:40:29 kid1| Swap maxSize 1024000 + 524288 KB, estimated
119099 objects
2012/11/17 08:40:29 kid1| Target number of buckets: 5954
2012/11/17 08:40:29 kid1| Using 8192 Store buckets
2012/11/17 08:40:29 kid1| Max Mem  size: 524288 KB
2012/11/17 08:40:29 kid1| Max Swap size: 1024000 KB
2012/11/17 08:40:29 kid1| Rebuilding storage in /var/spool/squid 
(dirty log)
2012/11/17 08:40:29 kid1| Using Least Load store dir selection
2012/11/17 08:40:29 kid1| Set Current Directory to /var/spool/squid
2012/11/17 08:40:30 kid1| Loaded Icons.
2012/11/17 08:40:30 kid1| HTCP Disabled.
2012/11/17 08:40:30 kid1| Squid plugin modules loaded: 0
2012/11/17 08:40:30 kid1| Adaptation support is off.
2012/11/17 08:40:30 kid1| Accepting HTTP Socket connections at
local=0.0.0.0:3128 remote=[::] FD 18 flags=9
2012/11/17 08:40:30 kid1| Store rebuilding is 0.11% complete
2012/11/17 08:40:45 kid1| Store rebuilding is 40.44% complete
2012/11/17 08:41:00 kid1| Store rebuilding is 83.15% complete
2012/11/17 08:41:05 kid1| Done reading /var/spool/squid swaplog 
(3699238
entries)
2012/11/17 08:41:05 kid1| Finished rebuilding storage from disk.
2012/11/17 08:41:05 kid1|   1877119 Entries scanned
2012/11/17 08:41:05 kid1|         6 Invalid entries.
2012/11/17 08:41:05 kid1|         0 With invalid flags.
2012/11/17 08:41:05 kid1|     54901 Objects loaded.
2012/11/17 08:41:05 kid1|         0 Objects expired.
2012/11/17 08:41:05 kid1|   1821036 Objects cancelled.
2012/11/17 08:41:05 kid1|        53 Duplicate URLs purged.
2012/11/17 08:41:05 kid1|      1129 Swapfile clashes avoided.
2012/11/17 08:41:05 kid1|   Took 35.95 seconds (1527.34 objects/sec).
2012/11/17 08:41:05 kid1| Beginning Validation Procedure
2012/11/17 08:41:06 kid1|   Completed Validation Procedure
2012/11/17 08:41:06 kid1|   Validated 54895 Entries
2012/11/17 08:41:06 kid1|   store_swap_size = 922068.00 KB
2012/11/17 08:41:06 kid1| storeLateRelease: released 6 objects

Thus I started with recommendation #2: setting 'memory_pools on'.
This seemed to me firstly as it highly helped: CPU load was dropped
to approx. 40-50% at the time of highest load (this is only between
18 to 21 hours our time /UTC+1/, in all other time is squid load
much smaller and CPU load was reasonable). But then I noticed
that '(squid-1)' process (PID 14667 below, child of 11769) changes
its PID, as if died and parent started new copy:

    1 11769 11769 11769 ? -1 Ss  0 0:00 squid -f 
/etc/squid/squid.conf
11769 14667 11769 11769 ? -1 R  23 0:11 (squid-1) -f 
/etc/squid/squid.conf
14667 14670 11769 11769 ? -1 S  23 0:00 (logfile-daemon)
/var/log/squid/access.log

And then I found in /var/log/messages these lines
(/var/log/squid/cache.log contained nothing suspicious(*) ):

Nov 19 16:53:40 172 squid[2834]: Squid Parent: (squid-1) process 
32702
exited due to signal 6 with status 0

Ouch. That needs to be found and fixed.
Are you able to assist with that by running Squid under a debugger and 
getting a backtrace? 
(http://wiki.squid-cache.org/SquidFaq/BugReporting#Using_gdb_debugger_on_a_live_proxy_.28with_minimal_downtime.29)

<snip repeats>
Nov 19 19:14:19 172 squid[2834]: Squid Parent: (squid-1) process 8804
will not be restarted due to repeated, frequent failures
Nov 19 19:14:19 172 squid[2834]: Exiting due to repeated, frequent 
failures

At this time exited parent squid process as well and it is necessary
manually restart squid. Note this complete exit occurs after seven 
days
of working. And before these 7 days squid crashed with this message 
in
"/var/log/message":

Nov 12 20:14:38 172 kernel: [9002060.957453] squid[2494]: segfault at 
e51
ip 082d4917 sp bfc7dbc0 error 6 in squid[8048000+460000]
Nov 12 20:14:38 172 squid[32322]: Squid Parent: (squid-1) process 
2494
exited due to signal 11 with status 0

Signal 6 is SIGABRT, how it can appear here?

(*) /var/log/squid/cache.log contains several messages like this, I 
not
know if they are important:
<snip>
2012/11/19 20:55:52 kid1| Failed to select source for

'http://u1353354961016s928369347.r6.td.ipv6test.semnicneposilejte.cz/1x1.png?u1353354961016s928369347.r6.td'
2012/11/19 20:55:52 kid1|   always_direct = 0
2012/11/19 20:55:52 kid1|    never_direct = 0
2012/11/19 20:55:52 kid1|        timedout = 0

--> IPv6-only domain

2012/11/19 20:55:52 kid1| Failed to select source for
'http://pages.etology.com/crossdomain.xml'
2012/11/19 20:55:52 kid1|   always_direct = 0
2012/11/19 20:55:52 kid1|    never_direct = 0
2012/11/19 20:55:52 kid1|        timedout = 0

--> NXDOMAIN

2012/11/19 20:55:53 kid1| ipcacheParse: No Address records in 
response to
'ipv6.msftncsi.com'

--> IPv6-only domain

You can ignore *most* of these for now, we have a bug open and being 
worked on to clean up the unnecessary logging from normal DNS failures 
and make any important ones stand out better.

What now?
Should I try newer version then that just tested 
3.2.3.20121106.r11695 ?
Or I have some wrong settings?
I'm not using IPv6 address on any interface and ipv6.ko kernel module
isn't loaded (it is Fedora 14 i686, which still has ipv6 as module;
later Fedora versions have ipv6 compiled in kernel). It is OK?
(squid has not any command-line switches for disabling IPv6, right?)

Squid auto-detects IPv6 on startup. That seems to be working fine. It 
has detected the lack in your kernel already and the DNS results above 
are what happens when IPv6-only domains are asked for by your clients - 
they get an error page back instead of the object requested.

Amos