Kadlecsik Jozsef wrote:
On Thu, 10 Apr 2008, Kadlecsik Jozsef wrote:
But this is a good clue to what might bite us most! Our GFS cluster is an
almost mail-only cluster for users with Maildir. When the users experience
temporary hangups for several seconds (even when writing a new mail), it
might be due to the concurrent scanning for a new mail on one node by the
MUA and the delivery to the Maildir in another node by the MTA.
I personally don't know much about mail server. But if anyone can
explain more about what these two processes (?) do, say, how does that
"MTA" deliver its mail (by "rename" system call ?) and/or how mails are
moved from which node to where, we may have a better chance to figure
this puzzle out.
Note that "rename" system call is normally very expensive. Minimum 4
exclusive locks are required (two directory locks, one file lock for
unlink, one file lock for link), plus resource group lock if block
allocation is required. There are numerous chances for deadlocks if not
handled carefully. The issue is further worsen by the way GFS1 does its
lock ordering - it obtains multiple locks based on lock name order. Most
of the locknames are taken from inode number so their sequence always
quite random. As soon as lock contention occurs, lock requests will be
serialized to avoid deadlocks. So this may be a cause for these spikes
where "rename"(s) are struggling to get lock order straight. But I don't
know for sure unless someone explains how email server does its things.
BTW, GFS2 has relaxed this lock order issue so it should work better.
I'm having a trip (away from internet) but I'm interested to know this
story... Maybe by the time I get back on my laptop, someone has figured
this out. But please do share the story :) ...
-- Wendy
What is really strange (and distrurbing) that such "hangups" can take
10-20 seconds which is just too much for the users.
Yesterday we started to monitor the number of locks/held locks on two of
the machines. The results from the first day can be found at
http://www.kfki.hu/~kadlec/gfs/.
It looks as Maildir is definitely a wrong choice for GFS and we should
consider to convert to mailbox format: at least I cannot explain the
spikes in another way.
In order to look at the possible tuning options and the side effects, I
list what I have learned so far:
- Increasing glock_purge (percent, default 0) helps to trim back the
unused glocks by gfs_scand itself. Otherwise glocks can accumulate and
gfs_scand eats more and more time at scanning the larger and
larger table of glocks.
- gfs_scand wakes up every scand_secs (default 5s) to scan the glocks,
looking for work to do. By increasing scand_secs one can lessen the load
produced by gfs_scand, but it'll hurt because flushing data can be
delayed.
- Decreasing demote_secs (seconds, default 300) helps to flush cached data
more often by moving write locks into less restricted states. Flushing
often helps to avoid burstiness *and* to prolong another nodes'
lock access. Question is, what are the side effects of small
demote_secs values? (Probably there is no much point to choose
smaller demote_secs value than scand_secs.)
Currently we are running with 'glock_purge = 20' and 'demote_secs = 30'.
Best regards,
Jozsef
--
E-mail : kadlec@xxxxxxxxxxxx, kadlec@xxxxxxxxxxxxxxxxx
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
H-1525 Budapest 114, POB. 49, Hungary
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster