Re: Feature: Automagic lock-revocation for features/locks xlator (v3.7.x)

Richard Wareing <rwareing@xxxxxx> · Mon, 25 Jan 2016 02:47:11 +0000

Yup per domain would be useful, the patch itself currently honors domains as well.  So locks in a different domains will not be touched during revocation.

In our cases we actually prefer to pull the plug on SHD/DHT domains to ensure clients do not hang, this is important for DHT self heals which cannot be disabled via any option, we've found in most cases once we reap the lock another
 properly behaving client comes along and completes the DHT heal properly.

Richard

Sent from my iPhone

On Jan 24, 2016, at 6:42 PM, Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> wrote:

On 01/25/2016 02:17 AM, Richard Wareing wrote:

Hello all,

Just gave a talk at SCaLE 14x today and I mentioned our new locks revocation feature which has had a significant impact on our GFS cluster reliability.  As such I wanted to share the patch with the community, so here's the bugzilla report:

https://bugzilla.redhat.com/show_bug.cgi?id=1301401

=====
Summary:

Mis-behaving brick clients (gNFSd, FUSE, gfAPI) can cause cluster instability and eventual complete unavailability due to failures in releasing entry/inode locks in a timely manner.

Classic symptoms on this are increased brick (and/or gNFSd) memory usage due the high number of (lock request) frames piling up in the processes.  The failure-mode results in bricks eventually slowing down to a crawl due to swapping, or OOMing due to complete
 memory exhaustion; during this period the entire cluster can begin to fail.  End-users will experience this as hangs on the filesystem, first in a specific region of the file-system and ultimately the entire filesystem as the offending brick begins to turn
 into a zombie (i.e. not quite dead, but not quite alive either).

Currently, these situations must be handled by an administrator detecting & intervening via the "clear-locks" CLI command.  Unfortunately this doesn't scale for large numbers of clusters, and it depends on the correct (external) detection of the locks
 piling up (for which there is little signal other than state dumps).

This patch introduces two features to remedy this situation:

1. Monkey-unlocking - This is a feature targeted at developers (only!) to help track down crashes due to stale locks, and prove the utility of he lock revocation feature.  It does this by silently dropping 1% of unlock requests; simulating bugs or mis-behaving
 clients.

The feature is activated via:
features.locks-monkey-unlocking <on/off>

You'll see the message
"[<timestamp>] W [inodelk.c:653:pl_inode_setlk] 0-groot-locks: MONKEY LOCKING (forcing stuck lock)!" ... in the logs indicating a request has been dropped.

2. Lock revocation - Once enabled, this feature will revoke a *contended*lock  (i.e. if nobody else asks for the lock, we will not revoke it) either
 by the amount of time the lock has been held, how many other lock requests are waiting on the lock to be freed, or some combination of both.  Clients which are losing their locks will be notified by receiving EAGAIN (send back to their callback function).

The feature is activated via these options:
features.locks-revocation-secs <integer; 0 to disable>
features.locks-revocation-clear-all [on/off]
features.locks-revocation-max-blocked <integer>

Recommended settings are: 1800 seconds for a time based timeout (give clients the benefit of the doubt, or chose a max-blocked requires some experimentation depending on your workload, but generally values of hundreds to low thousands (it's normal for
 many ten's of locks to be taken out when files are being written @ high throughput).

I really like this feature. One question though, self-heal, rebalance domain locks are active until self-heal/rebalance is complete which can take more than 30 minutes if the files are in TBs. I will try to see what we can do to handle these without increasing
 the revocation-secs too much. May be we can come up with per domain revocation timeouts. Comments are welcome.

Pranith

=====

The patch supplied will patch clean the the v3.7.6 release tag, and probably to any 3.7.x release & master (posix locks xlator is rarely touched).

Richard

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel