Re: [PATCH] nfsd: fix delegation_blocked() to block correctly for at least 30 seconds

"NeilBrown" <neilb@xxxxxxx> · Mon, 16 Sep 2024 11:17:48 +1000

On Thu, 12 Sep 2024, Olga Kornievskaia wrote:
> On Tue, Sep 10, 2024 at 8:00 PM NeilBrown <neilb@xxxxxxx> wrote:
> >
> > On Mon, 09 Sep 2024, Jeff Layton wrote:
> > > On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote:
> > > > The pair of bloom filtered used by delegation_blocked() was intended to
> > > > block delegations on given filehandles for between 30 and 60 seconds.  A
> > > > new filehandle would be recorded in the "new" bit set.  That would then
> > > > be switch to the "old" bit set between 0 and 30 seconds later, and it
> > > > would remain as the "old" bit set for 30 seconds.
> > > >
> > >
> > > Since we're on the subject...
> > >
> > > 60s seems like an awfully long time to block delegations on an inode.
> > > Recalls generally don't take more than a few seconds when things are
> > > functioning properly.
> > >
> > > Should we swap the bloom filters more often?
> >
> > Or should we take a completely different approach?  Or both?
> > I'm bothered by the fact that this bug in a heuristic caused rename to
> > misbehave this way.  So I did some exploration.
> >
> > try_break_deleg / break_deleg_wait are called:
> >
> >  - notify_change() - chmod_common, chown_common, and vfs_utimes  waits
> >  - vfs_unlink() - do_unlinkat waits for the delegation to be broken
> >  - vfs_link() - do_linkat waits
> >  - vfs_rename() - do_renameat2 waits
> >  - vfs_set_acl() - waits itself
> >  - vfs_remove_acl() - ditto
> >  - __vfs_setxattr_locked() - vfs_setxattr waits
> >  - __vfs_removexattr_locked() - vfs_removexattr waits
> >
> > I would argue that *none* of these justify delaying further delegations
> > once the operation itself has completed.  They aren't the sort of things
> > that suggest on-going cross-client "sharing" which delegations interfere
> > with.
> 
> I wouldn't discount these operations (at least not rename) from being
> an operation that can't represent "sharing" of files. An example
> workload is where a file gets generated, created, written/read over
> the NFS, but then locally then transferred to another filesystem. I
> can imagine a pipeline, where then file gets filled up and the
> generated data moved to be worked on elsewhere and the same file gets
> filled up again. I think this bug was discovered because of an setup
> where there was a heavy use of these operations (on various files) and
> some got blocked causing problems. For such workload, if we are not
> going to block giving out a delegation do we cause too many
> cb_recalls?

A pipeline as you describe seem to be a case of serial sharing.
Different applications use the same file, but only at different times.
This sort of sharing isn't hurt by delegations.

The sort of sharing the might trigger excessive cb_recalls if
delegations weren't blocked would almost certainly involve file locking
and an expectation that two separate applications would sometimes access
the file concurrently.  When this is happening, neither should get a
delegation.

The problem you saw was really caused by a delegation being given out
while the rename was still happening.
i.e.:
  - the rename starts
  - the delegation is detected and broken
  - the cb_recall is sent.
  - the client opens the file prior to returning the delegation
  - the client gets a new delegation as part of this open
  - the client returns the original delegation
  - the rename loops around and finds a new delegation which it needs
    to break.

The should only loop once unless the recall takes more than 30 seconds. 
So I'm a bit perplexed that it blocked lock enough to be noticed.  So
maybe there is more going on here than I can see.  Or maybe the recall
is really slow.

In either case I think we want to firmly block delegations while a rename
(unlink etc) is happening, but do not need to continue to block them
after the rename etc completes.

> 
> > I imagine try_break_lease would increment some counter in ->i_flctx
> > which prevents further leases being taken, and some new call near where
> > break_deleg_wait() is called will decrement that counter.  If waiting
> > for a lease to break, call break_deleg_wait() and retry, else
> > break_deleg_done().
> >
> > Delegations are also broken by calls break_lease() which comes from
> > nfsd_open_break_lease() for conflicting nfsd opens.  I think there *are*
> > indicitive of shares and *should* result in the filehandle being recorded
> > in the bloom filter.
> > Maybe if break_lease() in nfsd_open_break_lease() returns -EWOULDBLOCK,
> > then the filehandle could be added to the bloom filter at that point.
> 
> What about vfs_truncate() which also calls break_lease?

vfs_truncate() is quite different, though that isn't a good excuse for
me to leave it out.  It uses break_lease() like open does - not
try_break_deleg() like rename.
It can do this because it has called get_write_access() in the inode
which has incremented i_writecount which will prevent leases from being
granted.  But that DOESN'T prevent delegations from being granted.  I
think it should.

> 
> > do_dentry_open also calls break_lease() but we still want the filehandle
> > to go in the bloom-filter in that case ....  maybe the lm_break callback
> > could check i_writecount and i_readcount and use those to determine if
> > delaying delegations is appropriate.
> >
> > wrt the time to block delegation for, I'm not sure that it matters much.
> > Once we focus the delay on files that are concurrently opened from
> > multiple clients (not both read-only) I think there are some files
> > (most) that are never shared, and some files that are often shared.  We
> > probably don't want delegations for often shared files at all.  So I'd
> > be comfortable blocking delegations on often-shared files for quite a
> > while.
> 
> So perhaps rename might be an exception among the operations that are
> triggering delegation recalls. Though I think unlink might be similar
> and truncate. Otherwise, perhaps optimizing other operation would be
> useful. However, I would like to ask does the added complexity justify
> the benefits? What kind of workload would be greatly penalized? If we
> imagine the other operations are low occurrence (ie not representing a
> sharing example) then the penalty is just an infrequent 60s block.
> 

My concern is that the current design is conceptually wrong.
We need to block delegation while rename etc is happening.  We currently
block for 30-60 seconds.  As rename never takes that long, this works.
But I don't like it.

Some day someone might find a workload for which the current delay is
too long.  If we just reduced the size of the delay, we might make it
take less time than a rename, and something unexpected would break.

So I want to fix the design so that it make sense.  Then I won't really
care how long the delay is for.

> > I wouldn't advocate for *longer* than 30 seconds because I don't want
> > the bloom filters to fill up - that significantly reduced their
> > usefulness.  So maybe we could also swap filters when the 'new' one
> > becomes 50% full....
> 
> I was suggesting changing that value to 15 because it would mean that
> at most the wait would be 30s which is same as what it was before the
> fix but we are imposing at least 15s block and it keeps the filter 'at
> the same capacity as before'.  But of course in the case of heavy
> conflicting operations there will be less blockage and more recalls.
> This is where I was hoping a configurable value might be handy. Or, I
> would instead argue that we implement a heuristic of detecting that
> the server is in a state of frequent delegation recall phase and then
> automatically adjust the value for how long the delegations are
> blocked. Perhaps  we keep a counter in the bloom_filter when a block
> is set and when a threshold is reached we increase the value before we
> swap the filters (yes at the risk of having a full bloom filter but
> because it's better to be cautious in giving out delegations when
> there are frequent recalls?). Then, with perhaps another timer we
> adjust the block time down.... Wouldn't that cover either having
> different NFS clients having conflicting opens or having some workload
> that has NFS/local file system conflicts.
> 

I agree that as we have been use "0-30 seconds" for 10 year with no
complaints, changing to "15-30 seconds" is safer than change to "30-60
seconds".

I think that if we want to fine-tune further, there are other steps we
could take which might more clearly be helpful.
Currently we only block delegation when a recall happens.  This means
the 30-60 second count is started when a client opens a file and sharing
is detected.  I think we should also start that timer when a client
CLOSES a file which some other client has open.  So if it is ever the
case that to clients have the file open at the same time, we block
delegations until 30-60 seconds after the file has been closed by all
clients. 

i.e I don't really want to detect "frequent delegation recall".  I want
to detect "ongoing sharing".  In that case I don't want any delegations
or any recall.

Thanks,
NeilBrown