Re: Slow lmtpd

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




Can values way above 100% be trusted? If so, it's pretty bad (this is
from a situation where there are 200 lmtp processes, which is the
current limit I set):

I've never seen over 100%, and it doesn't seem to make sense, so I'm guessing it's a bogus value.

avg-cpu:  %user   %nice %system %iowait   %idle
          2.53    0.00    5.26   89.98    2.23

However this shows that the system is mainly waiting on IO as we expected.

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await  svctm  %util
etherd/e0.0
            0.00   0.00  5.87 235.02  225.10 2513.77   112.55  1256.88
11.37     0.00  750.32 750.32 18074.51

Ugg, if you line those up, await = 750.32

await - The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.

So it's taking 0.75 seconds on average to service an IO request, that's really bad.

Load average tends to get really high. It starts increasing really fast
after the number of lmtpd processes reaches the limit set in cyrus.conf,
and can easily get to 150 or 200. One of the moments where the problem

Makes sense. There's 200 lmtpd processes waiting on IO, and in linux at least, the load average is calculated as number of processes not in "sleep" state basically.

Really you never want that many lmtpd processes, if they're all in use, it's clear you've got an IO problem. Limiting it to 10 or so is probably a reasonable number to avoid complete IO saturation and IO sevice delays.

- The ones that don't have the problem use local disks instead of AoE
- The ones that don't have the problem are limited to 2000 domains
(around 8000 accounts), while the one using the AoE storage serves 4000
domains (around 20000 accounts).

Anyone running cyrus with that many accounts?

Yes, no problem, though using local disks.

I think the problem is probably the latency that AoE introduces into the disk path. A couple of questions

1. How many disks in the AoE array?
2. Are they all one RAID array, or multiple RAID arrays? What type?
3. Are they one volume, or multiple volumes?

Because of the latency for system <-> drive IO, the thing you want to try and do is allow the OS to send more outstanding requests in parallel. The problem is I don't know where in the FS <-> RAID <-> AoE path the serialising bits are, so I'm not sure what the best things to do to increase parallelism are, but the usualy things to try are more RAID arrays with less drives per array, and more volumes per RAID array. This gives more places for parallelism to occur assuming there's not something holding some internal lock somewhere.

Some of our machines have 4 RAID arrays divided up into 40 separate filesystems/volumes.

Rob

----
Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

[Index of Archives]     [Cyrus SASL]     [Squirrel Mail]     [Asterisk PBX]     [Video For Linux]     [Photo]     [Yosemite News]     [gtk]     [KDE]     [Gimp on Windows]     [Steve's Art]

  Powered by Linux