On Thu, Dec 4, 2014 at 1:58 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote: > On Thu, Dec 4, 2014 at 1:23 PM, Trond Myklebust > <trond.myklebust@xxxxxxxxxxxxxxx> wrote: >> On Thu, Dec 4, 2014 at 12:36 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote: >>> On Wed, Dec 3, 2014 at 6:50 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote: >>>> On Wed, Dec 3, 2014 at 6:34 PM, Trond Myklebust >>>> <trond.myklebust@xxxxxxxxxxxxxxx> wrote: >>>>> >>>>> On Dec 3, 2014 6:21 PM, "Olga Kornievskaia" <aglo@xxxxxxxxx> wrote: >>>>>> >>>>>> On Wed, Dec 3, 2014 at 6:13 PM, Trond Myklebust >>>>>> <trond.myklebust@xxxxxxxxxxxxxxx> wrote: >>>>>> > On Wed, Dec 3, 2014 at 5:59 PM, Olga Kornievskaia <aglo@xxxxxxxxx> >>>>>> > wrote: >>>>>> >> On Wed, Dec 3, 2014 at 3:59 PM, Trond Myklebust >>>>>> >> <trond.myklebust@xxxxxxxxxxxxxxx> wrote: >>>>>> >>> Hi Olga, >>>>>> >>> >>>>>> >>> On Wed, Dec 3, 2014 at 2:54 PM, Olga Kornievskaia <aglo@xxxxxxxxx> >>>>>> >>> wrote: >>>>>> >>>> Hi folks, >>>>>> >>>> >>>>>> >>>> I would like an opinion about changing code in such as way that we >>>>>> >>>> don't select a delegation stateid for an IO operation when this >>>>>> >>>> particular delegation is being returned. >>>>>> >>>> >>>>>> >>>> The reason it's some what problematic is that we send out a >>>>>> >>>> DELEG_RETURN and then we don't remove the stateid until we receive a >>>>>> >>>> reply. In the mean while, an IO operation can be happening and in >>>>>> >>>> nfs4_select_rw_stateid() it sees a delegation stateid and uses it. >>>>>> >>>> Well, at the server, it finishes processing DELEG_RETURN before >>>>>> >>>> getting an IO op and by that time the server is finished with the >>>>>> >>>> stateid and can error an IO operation with BAD_STATEID. >>>>>> >>>> >>>>>> >>>> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c >>>>>> >>>> index 7f3f606..4f6f6c9 100644 >>>>>> >>>> --- a/fs/nfs/delegation.c >>>>>> >>>> +++ b/fs/nfs/delegation.c >>>>>> >>>> @@ -854,7 +854,8 @@ bool nfs4_copy_delegation_stateid(nfs4_stateid >>>>>> >>>> *dst, struct inode *in >>>>>> >>>> flags &= FMODE_READ|FMODE_WRITE; >>>>>> >>>> rcu_read_lock(); >>>>>> >>>> delegation = rcu_dereference(nfsi->delegation); >>>>>> >>>> - ret = (delegation != NULL && (delegation->type & flags) == >>>>>> >>>> flags); >>>>>> >>>> + ret = (delegation != NULL && (delegation->type & flags) == >>>>>> >>>> flags && >>>>>> >>>> + !test_bit(NFS_DELEGATION_RETURNING, >>>>>> >>>> &delegation->flags)); >>>>>> >>>> if (ret) { >>>>>> >>>> nfs4_stateid_copy(dst, &delegation->stateid); >>>>>> >>>> nfs_mark_delegation_referenced(delegation); >>>>>> >>> >>>>>> >>> The above cannot eliminate the possibility that we won't use a >>>>>> >>> delegation while it is being returned. It will at best just reduce the >>>>>> >>> window of opportunity. >>>>>> >>> >>>>>> >> >>>>>> >> You are right this are still problems. Actually, we might set the bit >>>>>> >> but not yet get the open stateid from the open with deleg_cur and >>>>>> >> that's not good. It would be good to know we got the open stateid and >>>>>> >> then pick that. >>>>>> >> >>>>>> >>> So, why is this being considered to be a problem in the first place? >>>>>> >>> Are you seeing a measurable performance impact on a real life workload >>>>>> >>> (as opposed to some 1-in-a-billion occurrence from a QA test :-))? >>>>>> >> >>>>>> >> Unfortunately, this problem is quite common and I hit it all the time >>>>>> >> on my setup. This leads to client seizing IO on that file and >>>>>> >> returning EIO. It's an unrecoverable error. I'm trying to figure out >>>>>> >> how to eliminate getting to that state. >>>>>> >> >>>>>> > >>>>>> > It definitely isn't intended to be an irrecoverable error. The client >>>>>> > is supposed to just replay the write after updating the stateid. >>>>>> >>>>>> open(deleg_cur) call / reply >>>>>> lock() call/reply >>>>>> deleg_return() call >>>>>> write(with deluge_stateid) call gets BAD_STATEID >>>>>> state recovery code marks lock state lost -> EIO. >>>>> >>>>> Why is it marking the lock as lost? If the recovery succeeded, it should >>>>> notice that the stateid has changed and instead retry. >>>> >>>> I'll get you a better explanation tomorrow besides saying "that's what >>>> I see when I run the code". >>> >>> nfs4_async_handle_error() initiates state recovery >>> nfs4_reclaim_open_state() eventually calls nfs4_reclaim_locks() which >>> marks the lock LOST. state is delegated so the kernel logs "lock >>> reclaim failed". >>> write retries and in nfs4_copy_ lock_stateid() the lock is marked LOST >>> and the nfs4_select_rw_stateid() fails with EIO. >>> >>>> >>>>> What kernel is this? >>>> >>>> This is upstream. >> >> So why isn't nfs4_write_stateid_changed() catching the change before >> we even get to nfs4_async_handle_error()? That's where this race is >> supposed to get resolved. > > Probably because nfs4_stateid_is_current() returns true because > nfs4_set_rw_stateid() calls the nfs4_select_rw_stateid() and finds the > lock lost? I don't understand why the lock would already be marked as lost. Remember that we're calling from inside nfs4_write_done() and before calling nfs4_async_handle_error(). -- Trond Myklebust Linux NFS client maintainer, PrimaryData trond.myklebust@xxxxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html