Re: not picking a delegation stateid for IO when delegation stateid is being returned

Olga Kornievskaia <aglo@xxxxxxxxx> · Thu, 4 Dec 2014 15:57:37 -0500

On Thu, Dec 4, 2014 at 3:49 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
> On Thu, Dec 4, 2014 at 2:08 PM, Trond Myklebust
> <trond.myklebust@xxxxxxxxxxxxxxx> wrote:
>> On Thu, Dec 4, 2014 at 1:58 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>> On Thu, Dec 4, 2014 at 1:23 PM, Trond Myklebust
>>> <trond.myklebust@xxxxxxxxxxxxxxx> wrote:
>>>> On Thu, Dec 4, 2014 at 12:36 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>>> On Wed, Dec 3, 2014 at 6:50 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>>>> On Wed, Dec 3, 2014 at 6:34 PM, Trond Myklebust
>>>>>> <trond.myklebust@xxxxxxxxxxxxxxx> wrote:
>>>>>>>
>>>>>>> On Dec 3, 2014 6:21 PM, "Olga Kornievskaia" <aglo@xxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>> On Wed, Dec 3, 2014 at 6:13 PM, Trond Myklebust
>>>>>>>> <trond.myklebust@xxxxxxxxxxxxxxx> wrote:
>>>>>>>> > On Wed, Dec 3, 2014 at 5:59 PM, Olga Kornievskaia <aglo@xxxxxxxxx>
>>>>>>>> > wrote:
>>>>>>>> >> On Wed, Dec 3, 2014 at 3:59 PM, Trond Myklebust
>>>>>>>> >> <trond.myklebust@xxxxxxxxxxxxxxx> wrote:
>>>>>>>> >>> Hi Olga,
>>>>>>>> >>>
>>>>>>>> >>> On Wed, Dec 3, 2014 at 2:54 PM, Olga Kornievskaia <aglo@xxxxxxxxx>
>>>>>>>> >>> wrote:
>>>>>>>> >>>> Hi folks,
>>>>>>>> >>>>
>>>>>>>> >>>> I would like an opinion about changing code in such as way that we
>>>>>>>> >>>> don't select a delegation stateid for an IO operation when this
>>>>>>>> >>>> particular delegation is being returned.
>>>>>>>> >>>>
>>>>>>>> >>>> The reason it's some what problematic is that we send out a
>>>>>>>> >>>> DELEG_RETURN and then we don't remove the stateid until we receive a
>>>>>>>> >>>> reply. In the mean while, an IO operation can be happening and in
>>>>>>>> >>>> nfs4_select_rw_stateid() it sees a delegation stateid and uses it.
>>>>>>>> >>>> Well, at the server, it finishes processing DELEG_RETURN before
>>>>>>>> >>>> getting an IO op and by that time the server is finished with the
>>>>>>>> >>>> stateid and can error an IO operation with BAD_STATEID.
>>>>>>>> >>>>
>>>>>>>> >>>> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
>>>>>>>> >>>> index 7f3f606..4f6f6c9 100644
>>>>>>>> >>>> --- a/fs/nfs/delegation.c
>>>>>>>> >>>> +++ b/fs/nfs/delegation.c
>>>>>>>> >>>> @@ -854,7 +854,8 @@ bool nfs4_copy_delegation_stateid(nfs4_stateid
>>>>>>>> >>>> *dst, struct inode *in
>>>>>>>> >>>>         flags &= FMODE_READ|FMODE_WRITE;
>>>>>>>> >>>>         rcu_read_lock();
>>>>>>>> >>>>         delegation = rcu_dereference(nfsi->delegation);
>>>>>>>> >>>> -       ret = (delegation != NULL && (delegation->type & flags) ==
>>>>>>>> >>>> flags);
>>>>>>>> >>>> +       ret = (delegation != NULL && (delegation->type & flags) ==
>>>>>>>> >>>> flags &&
>>>>>>>> >>>> +               !test_bit(NFS_DELEGATION_RETURNING,
>>>>>>>> >>>> &delegation->flags));
>>>>>>>> >>>>         if (ret) {
>>>>>>>> >>>>                 nfs4_stateid_copy(dst, &delegation->stateid);
>>>>>>>> >>>>                 nfs_mark_delegation_referenced(delegation);
>>>>>>>> >>>
>>>>>>>> >>> The above cannot eliminate the possibility that we won't use a
>>>>>>>> >>> delegation while it is being returned. It will at best just reduce the
>>>>>>>> >>> window of opportunity.
>>>>>>>> >>>
>>>>>>>> >>
>>>>>>>> >> You are right this are still problems. Actually, we might set the bit
>>>>>>>> >> but not yet get the open stateid from the open with deleg_cur and
>>>>>>>> >> that's not good. It would be good to know we got the open stateid and
>>>>>>>> >> then pick that.
>>>>>>>> >>
>>>>>>>> >>> So, why is this being considered to be a problem in the first place?
>>>>>>>> >>> Are you seeing a measurable performance impact on a real life workload
>>>>>>>> >>> (as opposed to some 1-in-a-billion occurrence from a QA test :-))?
>>>>>>>> >>
>>>>>>>> >> Unfortunately, this problem is quite common and I hit it all the time
>>>>>>>> >> on my setup. This leads to client seizing IO on that file and
>>>>>>>> >> returning EIO. It's an unrecoverable error. I'm trying to figure out
>>>>>>>> >> how to eliminate getting to that state.
>>>>>>>> >>
>>>>>>>> >
>>>>>>>> > It definitely isn't intended to be an irrecoverable error. The client
>>>>>>>> > is supposed to just replay the write after updating the stateid.
>>>>>>>>
>>>>>>>> open(deleg_cur) call / reply
>>>>>>>> lock() call/reply
>>>>>>>> deleg_return() call
>>>>>>>> write(with deluge_stateid) call gets BAD_STATEID
>>>>>>>> state recovery code marks lock state lost -> EIO.
>>>>>>>
>>>>>>> Why is it marking the lock as lost? If the recovery succeeded, it should
>>>>>>> notice that the stateid has changed and instead retry.
>>>>>>
>>>>>> I'll get you a better explanation tomorrow besides saying "that's what
>>>>>> I see when I run the code".
>>>>>
>>>>> nfs4_async_handle_error() initiates state recovery
>>>>> nfs4_reclaim_open_state() eventually calls  nfs4_reclaim_locks() which
>>>>> marks the lock LOST. state is delegated so the kernel logs "lock
>>>>> reclaim failed".
>>>>> write retries and in nfs4_copy_ lock_stateid() the lock is marked LOST
>>>>> and the nfs4_select_rw_stateid() fails with EIO.
>>>>>
>>>>>>
>>>>>>> What kernel is this?
>>>>>>
>>>>>> This is upstream.
>>>>
>>>> So why isn't nfs4_write_stateid_changed() catching the change before
>>>> we even get to nfs4_async_handle_error()? That's where this race is
>>>> supposed to get resolved.
>>>
>>> Probably because nfs4_stateid_is_current() returns true because
>>> nfs4_set_rw_stateid() calls the nfs4_select_rw_stateid() and finds the
>>> lock lost?
>>
>> I don't understand why the lock would already be marked as lost.
>> Remember that we're calling from inside nfs4_write_done() and before
>> calling nfs4_async_handle_error().
>
> Sigh. I'm so confused...
>
> On November 30th I posted "is receiving BAD_STATEID during IO on
> delegated stateid unrecoverable?" and you've agreed that it would lead
> to the EIO.
>
> There are a couple of cases i'm juggling and I'm just getting confused
> if they are different or same. 1 is using delegation stateid for the
> IO while we are returning the stateid and thus getting BAD_STATEID. 2
> is getting a BAD_STATEID on a delegated stateid on IO in without a
> deleg_return. It seems like 2nd case can also lead to the EIO. I have
> lossy traces from QA so it's possible that it's the 1st case and known
> to get the EIO.

got my cases mixed up. 2nd case was the november's message and agreed
to be unrecoverable. 1st case is what i'm also looking into (maybe).

>
>
>>
>> --
>> Trond Myklebust
>>
>> Linux NFS client maintainer, PrimaryData
>>
>> trond.myklebust@xxxxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html