Re: [PATCH] nfsd: fix delegation_blocked() to block correctly for at least 30 seconds

Olga Kornievskaia <okorniev@xxxxxxxxxx> · Wed, 11 Sep 2024 10:13:28 -0400

On Tue, Sep 10, 2024 at 8:00 PM NeilBrown <neilb@xxxxxxx> wrote:
>
> On Mon, 09 Sep 2024, Jeff Layton wrote:
> > On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote:
> > > The pair of bloom filtered used by delegation_blocked() was intended to
> > > block delegations on given filehandles for between 30 and 60 seconds.  A
> > > new filehandle would be recorded in the "new" bit set.  That would then
> > > be switch to the "old" bit set between 0 and 30 seconds later, and it
> > > would remain as the "old" bit set for 30 seconds.
> > >
> >
> > Since we're on the subject...
> >
> > 60s seems like an awfully long time to block delegations on an inode.
> > Recalls generally don't take more than a few seconds when things are
> > functioning properly.
> >
> > Should we swap the bloom filters more often?
>
> Or should we take a completely different approach?  Or both?
> I'm bothered by the fact that this bug in a heuristic caused rename to
> misbehave this way.  So I did some exploration.
>
> try_break_deleg / break_deleg_wait are called:
>
>  - notify_change() - chmod_common, chown_common, and vfs_utimes  waits
>  - vfs_unlink() - do_unlinkat waits for the delegation to be broken
>  - vfs_link() - do_linkat waits
>  - vfs_rename() - do_renameat2 waits
>  - vfs_set_acl() - waits itself
>  - vfs_remove_acl() - ditto
>  - __vfs_setxattr_locked() - vfs_setxattr waits
>  - __vfs_removexattr_locked() - vfs_removexattr waits
>
> I would argue that *none* of these justify delaying further delegations
> once the operation itself has completed.  They aren't the sort of things
> that suggest on-going cross-client "sharing" which delegations interfere
> with.

I wouldn't discount these operations (at least not rename) from being
an operation that can't represent "sharing" of files. An example
workload is where a file gets generated, created, written/read over
the NFS, but then locally then transferred to another filesystem. I
can imagine a pipeline, where then file gets filled up and the
generated data moved to be worked on elsewhere and the same file gets
filled up again. I think this bug was discovered because of an setup
where there was a heavy use of these operations (on various files) and
some got blocked causing problems. For such workload, if we are not
going to block giving out a delegation do we cause too many
cb_recalls?

> I imagine try_break_lease would increment some counter in ->i_flctx
> which prevents further leases being taken, and some new call near where
> break_deleg_wait() is called will decrement that counter.  If waiting
> for a lease to break, call break_deleg_wait() and retry, else
> break_deleg_done().
>
> Delegations are also broken by calls break_lease() which comes from
> nfsd_open_break_lease() for conflicting nfsd opens.  I think there *are*
> indicitive of shares and *should* result in the filehandle being recorded
> in the bloom filter.
> Maybe if break_lease() in nfsd_open_break_lease() returns -EWOULDBLOCK,
> then the filehandle could be added to the bloom filter at that point.

What about vfs_truncate() which also calls break_lease?

> do_dentry_open also calls break_lease() but we still want the filehandle
> to go in the bloom-filter in that case ....  maybe the lm_break callback
> could check i_writecount and i_readcount and use those to determine if
> delaying delegations is appropriate.
>
> wrt the time to block delegation for, I'm not sure that it matters much.
> Once we focus the delay on files that are concurrently opened from
> multiple clients (not both read-only) I think there are some files
> (most) that are never shared, and some files that are often shared.  We
> probably don't want delegations for often shared files at all.  So I'd
> be comfortable blocking delegations on often-shared files for quite a
> while.

So perhaps rename might be an exception among the operations that are
triggering delegation recalls. Though I think unlink might be similar
and truncate. Otherwise, perhaps optimizing other operation would be
useful. However, I would like to ask does the added complexity justify
the benefits? What kind of workload would be greatly penalized? If we
imagine the other operations are low occurrence (ie not representing a
sharing example) then the penalty is just an infrequent 60s block.

> I wouldn't advocate for *longer* than 30 seconds because I don't want
> the bloom filters to fill up - that significantly reduced their
> usefulness.  So maybe we could also swap filters when the 'new' one
> becomes 50% full....

I was suggesting changing that value to 15 because it would mean that
at most the wait would be 30s which is same as what it was before the
fix but we are imposing at least 15s block and it keeps the filter 'at
the same capacity as before'.  But of course in the case of heavy
conflicting operations there will be less blockage and more recalls.
This is where I was hoping a configurable value might be handy. Or, I
would instead argue that we implement a heuristic of detecting that
the server is in a state of frequent delegation recall phase and then
automatically adjust the value for how long the delegations are
blocked. Perhaps  we keep a counter in the bloom_filter when a block
is set and when a threshold is reached we increase the value before we
swap the filters (yes at the risk of having a full bloom filter but
because it's better to be cautious in giving out delegations when
there are frequent recalls?). Then, with perhaps another timer we
adjust the block time down.... Wouldn't that cover either having
different NFS clients having conflicting opens or having some workload
that has NFS/local file system conflicts.

>
> NeilBrown
>