On Feb 28, 2008, at 4:38 PM, Jeff Fookson wrote: > is about 200GB. There are typically about 200 'imapd' > processes at a given time and a hugely varying number of > 'lmtpds' (from > about 6 to many hundreds during > times of greatest pathology). System load is correspondingly in the > 2-15 > range, but can spike to 50-70! Typically when deadlocks free you get load spikes as work can now progress. It implies one thing was holding the lock for a long time - that thing itself probably being impeded by something else. If there was high activity of many things hitting the lock, you wouldn't expect to see spikes - the system might even look idle as everything is just waiting for the lock. > waits of upwards of 1-2 minutes to get a write lock as shown by the > example below (this is from a trace of an 'lmtpd') > > [strace -f -p 9817 -T] > 9817 fcntl(10, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, > len=0}) = 0 <84.998159> [...] > Can anyone suggest what we might do next to debug the problem further? Good job with the strace. Now figure out what fd 10 is, either by lsof or earlier in the strace output (look for "= 10" and that should show what opened it). Then install lslk and figure out who is holding the lock on that file and for how long, etc. Then look at that process to see what it's doing for so long (strace again). -nik ---- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html