Re: stable12 aufs performance (was: stable12 segfault)

Mike Rambo <mrambo@xxxxxxxxxxxxx> · Wed, 22 Mar 2006 14:20:22 -0500

Henrik Nordstrom wrote:
ons 2006-03-01 klockan 07:44 -0500 skrev Mike Rambo:

Bummer. No core file. Hmm, we're running Linux (CentOS 4.2) but diskd 
instead of aufs for the cache_dir, debugging symbols are there, I guess 
it's because squid didn't have write permission to its 'current 
directory' as listed in the logs at startup. I've corrected that. If 
this happens again I'll follow the instructions and file the report.

I would recommend aufs..

http://www.squid-cache.org/bugs/show_bug.cgi?id=761
http://www.squid-cache.org/bugs/show_bug.cgi?id=1500

Regards
Henrik

Switched over to aufs as recommended. We've went three weeks without
a crash of that type so it would appear that the segfault was indeed
diskd related. Unfortunately switching to aufs doesn't help get diskd fixed.

Another negative consequence of switching to aufs is that we're now
seeing noticeable slowness at peak usage with many of the following 
(line wrapped) warnings in the logs.

Dozens upon dozens of these:
2006/03/22 10:23:22| squidaio_queue_request: WARNING - Disk I/O overloading
2006/03/22 10:23:22| squidaio_queue_request: Queue Length: current=387,
high=399, low=261, duration=20

And fewer but consistent requests for more redirectors too (I've already 
doubled them and it wants several multiples more).

The server is quite busy:
[root@squid1 ~]# uptime
09:02:38 up 42 days, 21:38,  3 users,  load average: 29.51, 30.02, 28.22

An 'ultimate guide' site recommended looking at median service times:
Median Service Times (seconds)  5 min    60 min:
HTTP Requests (All):   0.24524  0.30459
Cache Misses:          0.12783  0.13498
Cache Hits:            1.24267  2.37608
Near Hits:             1.81376  3.28534
Not-Modified Replies:  1.11539  2.25116
DNS Lookups:           0.00464  0.00669
ICP Queries:           0.00000  0.00000

From cachemgr:
client_http.requests = 129.502997/sec
client_http.hits = 54.996938/sec
client_http.errors = 0.000000/sec
client_http.kbytes_in = 81.994487/sec
client_http.kbytes_out = 1164.754176/sec
client_http.all_median_svc_time = 0.304593 seconds
client_http.miss_median_svc_time = 0.134979 seconds
client_http.nm_median_svc_time = 2.132801 seconds
client_http.nh_median_svc_time = 3.285335 seconds
client_http.hit_median_svc_time = 2.251157 seconds

(pause)

While I was gathering information and putting together this email we did 
have another crash - this time with aufs. There is no mention of a 
segfault in the logs and no core file but there are other interesting 
log items.

Quite a number of these:
2006/03/22 12:03:57| storeAufsOpenDone: (2) No such file or directory
2006/03/22 12:03:57|    /mnt/cache1/09/3E/00193E5A

And:
2006/03/22 12:34:06| WARNING: All redirector processes are busy.
2006/03/22 12:34:06| WARNING: up to 237 pending requests queued
2006/03/22 12:34:06| Consider increasing the number of redirector
processes to at least 317 in your config file.
2006/03/22 12:34:36| WARNING: All redirector processes are busy.
2006/03/22 12:34:36| WARNING: up to 577 pending requests queued
2006/03/22 12:34:36| storeDirWriteCleanLogs: Starting...

(pause)

Back up with diskd we're using about half of the 75 redirect_children 
configured instead of running out of them and piling up requests. Total 
traffic is up and response times are way down too.

client_http.requests = 149.641784/sec
client_http.hits = 31.388008/sec
client_http.errors = 0.000000/sec
client_http.kbytes_in = 95.100682/sec
client_http.kbytes_out = 1310.018099/sec
client_http.all_median_svc_time = 0.092188 seconds
client_http.miss_median_svc_time = 0.114648 seconds
client_http.nm_median_svc_time = 0.006779 seconds
client_http.nh_median_svc_time = 0.082651 seconds
client_http.hit_median_svc_time = 0.007665 seconds

Load is way down from before - more what I was accustomed to.
[root@squid1 squid]# uptime
13:20:35 up 43 days,  1:56,  2 users,  load average: 1.21, 1.45, 2.60

One thing is different however. Because of some messages I saw I have 
remounted the cache_dirs with noatime which wasn't the case before.

I was going to include build and runtime configuration information but 
this mail is already huge so I'll hold on that until someone indicates 
they need either of those items.

The runtime config is huge so instead of including it in the email I've 
put it at an accessible location if it'll help.

http://scnc.lsd.k12.mi.us/~mrambo/runtime.txt

The bottom line question here is what can I do to get best stability and 
performance. I'll do what I can in helping to figure out some of these 
problems too.

Should this have been submitted as a bug report?

--
Mike Rambo
mrambo@xxxxxxxxxxxxx

"They that can give up essential liberty to obtain a little
temporary security, deserve neither liberty or security."
        -- Benjamin Franklin