Re: [Linux-cluster] GFS: more simple performance numbers

Daniel Phillips <phillips@xxxxxxxxxx> · Thu, 21 Oct 2004 15:12:09 -0400

On Thursday 21 October 2004 08:06, David Teigland wrote:
> When gfs is holding over DROP_LOCKS_COUNT locks (locally), lock_dlm
> tells gfs to "drop locks".  When gfs drops locks, it invalidates the
> cached data they protect.  du in the linux src tree requires gfs to
> acquire some 16,000 locks.  Since this exceeded 10,000, lock_dlm was
> having gfs toss the cached data from the previous du.  If we raise
> the limit to 100,000, there's no "drop locks" callback and everything
> remains cached.
>
> This "drop locks" callback is a way for the lock manager to throttle
> things when it begins reaching its own limitations.  10,000 was
> picked pretty arbitrarily because there's no good way for the dlm to
> know when it's reaching its limitations.  This is because the main
> limitation is free memory on remote nodes.
>
> The dlm can get into a real problem if gfs hold "too many" locks.  If
> a gfs node fails, it's likely that some of the locks the dlm mastered
> on that node need to be remastered on remaining nodes.  Those
> remaining nodes may not have enough memory to remaster all the locks
> -- the dlm recovery process eats up all the memory and hangs.

You need to maintain a memory pool for locks on each node and expand the 
pool as the number of locks increases.  Global balancing is needed to 
accommodate remastering, e.g., enforce that the sum of free lock pool 
on the cluster is always enough to remaster at least the N heaviest 
lock users.  With this approach, the hard limit on number of locks can 
be a large fraction of total installed memory.  If we teach the VM how 
to shrink the lock pool then no hard limit is needed at all, as for 
most kernel resources.

You are going to hit PF_MEMALLOC problems too because the VMM doesn't 
know anything about kernel gdlm daemons, so they never run in 
PF_MEMALLOC mode.  No matter how much pool the daemon has for its own 
kmallocs, the kernel subsystems it calls (including networking) don't 
know about your pools and can't use them.  Every allocation they do is 
a deadlock risk.  All by way of saying that userspace isn't the only 
victim of memory inversion, a definitive solution is needed for both 
kernel and userspace.

> Part of a solution would be to have gfs free a bunch of locks at this
> point, but that's not a near-term option.  So, we're left with the
> tradeoff:  favoring performance and increasing risk of too little
> memory for recovery or v.v.

How about increasing performance and reducing risk at the same time? 

Regards,

Daniel