Just gave a talk at SCaLE 14x today and I mentioned our new
locks revocation feature which has had a significant impact on
our GFS cluster reliability. As such I wanted to share the
patch with the community, so here's the bugzilla report:
Mis-behaving brick clients (gNFSd, FUSE, gfAPI) can cause
cluster instability and eventual complete unavailability due
to failures in releasing entry/inode locks in a timely
manner.
Classic symptoms on this are increased brick (and/or
gNFSd) memory usage due the high number of (lock request)
frames piling up in the processes. The failure-mode results
in bricks eventually slowing down to a crawl due to
swapping, or OOMing due to complete memory exhaustion;
during this period the entire cluster can begin to fail.
End-users will experience this as hangs on the filesystem,
first in a specific region of the file-system and ultimately
the entire filesystem as the offending brick begins to turn
into a zombie (i.e. not quite dead, but not quite alive
either).
Currently, these situations must be handled by an
administrator detecting & intervening via the
"clear-locks" CLI command. Unfortunately this doesn't scale
for large numbers of clusters, and it depends on the correct
(external) detection of the locks piling up (for which there
is little signal other than state dumps).
This patch introduces two features to remedy this
situation:
1. Monkey-unlocking - This is a feature targeted at
developers (only!) to help track down crashes due to stale
locks, and prove the utility of he lock revocation feature.
It does this by silently dropping 1% of unlock requests;
simulating bugs or mis-behaving clients.
The feature is activated via:
features.locks-monkey-unlocking <on/off>
You'll see the message
"[<timestamp>] W [inodelk.c:653:pl_inode_setlk]
0-groot-locks: MONKEY LOCKING (forcing stuck lock)!" ... in
the logs indicating a request has been dropped.
2. Lock revocation - Once enabled, this feature will
revoke a *contended*lock (i.e. if nobody else asks
for the lock, we will not revoke it) either by the amount of time the
lock has been held, how many other lock requests are
waiting on the lock to be freed, or some combination of
both. Clients which are losing their locks will be
notified by receiving EAGAIN (send back to their callback
function).
The feature is activated via these options:
features.locks-revocation-secs <integer; 0 to
disable>
features.locks-revocation-clear-all [on/off]
features.locks-revocation-max-blocked <integer>
Recommended settings are: 1800 seconds for a time based
timeout (give clients the benefit of the doubt, or chose a
max-blocked requires some experimentation depending on your
workload, but generally values of hundreds to low thousands
(it's normal for many ten's of locks to be taken out when
files are being written @ high throughput).