Re: [PATCH] nfsd: fix delegation_blocked() to block correctly for at least 30 seconds

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On Sep 10, 2024, at 8:32 AM, Tom Talpey <tom@xxxxxxxxxx> wrote:
> 
> On 9/9/2024 11:02 AM, Olga Kornievskaia wrote:
>> On Mon, Sep 9, 2024 at 10:38 AM Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>>> 
>>> On Mon, Sep 09, 2024 at 10:17:42AM -0400, Olga Kornievskaia wrote:
>>>> On Mon, Sep 9, 2024 at 8:24 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>>>>> 
>>>>> On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote:
>>>>>> The pair of bloom filtered used by delegation_blocked() was intended to
>>>>>> block delegations on given filehandles for between 30 and 60 seconds.  A
>>>>>> new filehandle would be recorded in the "new" bit set.  That would then
>>>>>> be switch to the "old" bit set between 0 and 30 seconds later, and it
>>>>>> would remain as the "old" bit set for 30 seconds.
>>>>>> 
>>>>> 
>>>>> Since we're on the subject...
>>>>> 
>>>>> 60s seems like an awfully long time to block delegations on an inode.
>>>>> Recalls generally don't take more than a few seconds when things are
>>>>> functioning properly.
>>>>> 
>>>>> Should we swap the bloom filters more often?
>>>> 
>>>> I was also thinking that perhaps we can do 15-30s perhaps? Another
>>>> thought was should this be a configurable value (with some
>>>> non-negotiable min and max)?
>>> 
>>> If it needs to be configurable, then we haven't done our jobs as
>>> system architects. Temporarily blocking delegation ought to be
>>> effective without human intervention IMHO.
>>> 
>>> At least let's get some specific usage scenarios that demonstrate a
>>> palpable need for enabling an admin to adjust this behavior (ie, a
>>> need that is not simply an implementation bug), then design for
>>> those specific cases.
>>> 
>>> Does NFSD have metrics in this area, for example?
>>> 
>>> Generally speaking, though, I'm not opposed to finessing the behavior
>>> of the Bloom filter. I'd like to apply the patch below for v6.12,
>> 100% agreed that we need this patch to go in now. The configuration
>> was just a thought for after which I should have stated explicitly. I
>> guess I'm not a big fan of hard coded numbers in the code and naively
>> thinking that it's always better to have a config over a hardcoded
>> value.
> 
> No constant is ever correct in networking, especially timeouts.

But a constant can be "good enough". No one has yet clearly
demonstrated, for this mechanism, that one timeout value
is worse than another.


> So yes,
> it should be adjustable. But even then, choosing a number here is
> fundamentally difficult.

I'm not saying it should never be adjustable. We need some
data showing what range of values is useful, and we need to
know where the control needs to be placed: export option?
sysctl? /sys or /proc? Can this knob be abused? Can the setting
be automated? Maybe the control should be "delay a few seconds
vs. put that client in the doghouse until server reboots"?

Indeed, it might be that any delay greater than zero has some
benefit. In which case, the flexibility of changing the filter
reset time starts to look unnecessary.

Point is: we don't know yet. I would like to see a plan for
how to gather more data, first. As I said yesterday, there
is a cost to adding knobs like this. I want to know what
we're getting for the money.

(To be clear, I'm not making a hard objection. I'm just
asking to wait for more information).


> Delegations can block for perfectly valid long periods, right? Say it
> takes a long time to flush a write delegation, or if the network is
> partitioned to the (other) client being recalled. 30 seconds to data
> corruption is quite the guillotine.

The Bloom filter is not about correctness, so data corruption
is not a factor. The purpose of this mechanism is to give a
fair forward progress guarantee in certain corner cases.


> Tom.
> 
>>> preserving the Cc: stable, but handle the behavioral finessing via
>>> a subsequent patch targeting v6.13 so that can be appropriately
>>> reviewed and tested. Ja?
>>> 
>>> BTW, nice catch!
>>> 
>>> 
>>>>>> Unfortunately the code intended to clear the old bit set once it reached
>>>>>> 30 seconds old, preparing it to be the next new bit set, instead cleared
>>>>>> the *new* bit set before switching it to be the old bit set.  This means
>>>>>> that the "old" bit set is always empty and delegations are blocked
>>>>>> between 0 and 30 seconds.
>>>>>> 
>>>>>> This patch updates bd->new before clearing the set with that index,
>>>>>> instead of afterwards.
>>>>>> 
>>>>>> Reported-by: Olga Kornievskaia <okorniev@xxxxxxxxxx>
>>>>>> Cc: stable@xxxxxxxxxxxxxxx
>>>>>> Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.")
>>>>>> Signed-off-by: NeilBrown <neilb@xxxxxxx>
>>>>>> ---
>>>>>>  fs/nfsd/nfs4state.c | 5 +++--
>>>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>> 
>>>>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>>>>>> index 4313addbe756..6f18c1a7af2e 100644
>>>>>> --- a/fs/nfsd/nfs4state.c
>>>>>> +++ b/fs/nfsd/nfs4state.c
>>>>>> @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid)
>>>>>>   * When a delegation is recalled, the filehandle is stored in the "new"
>>>>>>   * filter.
>>>>>>   * Every 30 seconds we swap the filters and clear the "new" one,
>>>>>> - * unless both are empty of course.
>>>>>> + * unless both are empty of course.  This results in delegations for a
>>>>>> + * given filehandle being blocked for between 30 and 60 seconds.
>>>>>>   *
>>>>>>   * Each filter is 256 bits.  We hash the filehandle to 32bit and use the
>>>>>>   * low 3 bytes as hash-table indices.
>>>>>> @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh)
>>>>>>               if (ktime_get_seconds() - bd->swap_time > 30) {
>>>>>>                       bd->entries -= bd->old_entries;
>>>>>>                       bd->old_entries = bd->entries;
>>>>>> +                     bd->new = 1-bd->new;
>>>>>>                       memset(bd->set[bd->new], 0,
>>>>>>                              sizeof(bd->set[0]));
>>>>>> -                     bd->new = 1-bd->new;
>>>>>>                       bd->swap_time = ktime_get_seconds();
>>>>>>               }
>>>>>>               spin_unlock(&blocked_delegations_lock);
>>>>> 
>>>>> --
>>>>> Jeff Layton <jlayton@xxxxxxxxxx>
>>>>> 
>>>> 
>>> 
>>> --
>>> Chuck Lever


--
Chuck Lever






[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux