Re: dlm and IO speed problem <er, might wanna get a coffee first ; )>

Wendy Cheng <s.wendy.cheng@xxxxxxxxx> · Fri, 11 Apr 2008 23:16:52 -0500

Kadlecsik Jozsef wrote:
On Thu, 10 Apr 2008, Kadlecsik Jozsef wrote:

But this is a good clue to what might bite us most! Our GFS cluster is an 
almost mail-only cluster for users with Maildir. When the users experience 
temporary hangups for several seconds (even when writing a new mail), it 
might be due to the concurrent scanning for a new mail on one node by the 
MUA and the delivery to the Maildir in another node by the MTA.

I personally don't know much about mail server. But if anyone can 
explain more about what these two processes (?) do, say, how does that 
"MTA" deliver its mail (by "rename" system call ?) and/or how mails are 
moved from which node to where, we may have a better chance to figure 
this puzzle out.

Note that "rename" system call is normally very expensive. Minimum 4 
exclusive locks are required (two directory locks, one file lock for 
unlink, one file lock for link), plus resource group lock if block 
allocation is required. There are numerous chances for deadlocks if not 
handled carefully. The issue is further worsen by the way GFS1 does its 
lock ordering - it obtains multiple locks based on lock name order. Most 
of the locknames are taken from inode number so their sequence always 
quite random. As soon as lock contention occurs, lock requests will be 
serialized to avoid deadlocks. So this may be a cause for these spikes 
where "rename"(s) are struggling to get lock order straight. But I don't 
know for sure unless someone explains how email server does its things. 
BTW, GFS2 has relaxed this lock order issue so it should work better.

I'm having a trip (away from internet) but I'm interested to know this 
story... Maybe by the time I get back on my laptop, someone has figured 
this out. But please do share the story :) ...

-- Wendy

What is really strange (and distrurbing) that such "hangups" can take 
10-20 seconds which is just too much for the users.

Yesterday we started to monitor the number of locks/held locks on two of 
the machines. The results from the first day can be found at 
http://www.kfki.hu/~kadlec/gfs/.

It looks as Maildir is definitely a wrong choice for GFS and we should 
consider to convert to mailbox format: at least I cannot explain the 
spikes in another way.

In order to look at the possible tuning options and the side effects, I 
list what I have learned so far:

- Increasing glock_purge (percent, default 0) helps to trim back the 
  unused glocks by gfs_scand itself. Otherwise glocks can accumulate and 
  gfs_scand eats more and more time at scanning the larger and 
  larger table of glocks.
- gfs_scand wakes up every scand_secs (default 5s) to scan the glocks,  
  looking for work to do. By increasing scand_secs one can lessen the load 
  produced by gfs_scand, but it'll hurt because flushing data can be 
  delayed.
- Decreasing demote_secs (seconds, default 300) helps to flush cached data
  more often by moving write locks into less restricted states. Flushing 
  often helps to avoid burstiness *and* to prolong another nodes' 
  lock access. Question is, what are the side effects of small
  demote_secs values? (Probably there is no much point to choose
  smaller demote_secs value than scand_secs.)

Currently we are running with 'glock_purge = 20' and 'demote_secs = 30'.

Best regards,
Jozsef
--
E-mail : kadlec@xxxxxxxxxxxx, kadlec@xxxxxxxxxxxxxxxxx
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster