On Thu, Dec 4, 2014 at 3:49 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote: > On Thu, Dec 4, 2014 at 2:08 PM, Trond Myklebust > <trond.myklebust@xxxxxxxxxxxxxxx> wrote: >> On Thu, Dec 4, 2014 at 1:58 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote: >>> On Thu, Dec 4, 2014 at 1:23 PM, Trond Myklebust >>> <trond.myklebust@xxxxxxxxxxxxxxx> wrote: >>>> On Thu, Dec 4, 2014 at 12:36 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote: >>>>> On Wed, Dec 3, 2014 at 6:50 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote: >>>>>> On Wed, Dec 3, 2014 at 6:34 PM, Trond Myklebust >>>>>> <trond.myklebust@xxxxxxxxxxxxxxx> wrote: >>>>>>> >>>>>>> On Dec 3, 2014 6:21 PM, "Olga Kornievskaia" <aglo@xxxxxxxxx> wrote: >>>>>>>> >>>>>>>> On Wed, Dec 3, 2014 at 6:13 PM, Trond Myklebust >>>>>>>> <trond.myklebust@xxxxxxxxxxxxxxx> wrote: >>>>>>>> > On Wed, Dec 3, 2014 at 5:59 PM, Olga Kornievskaia <aglo@xxxxxxxxx> >>>>>>>> > wrote: >>>>>>>> >> On Wed, Dec 3, 2014 at 3:59 PM, Trond Myklebust >>>>>>>> >> <trond.myklebust@xxxxxxxxxxxxxxx> wrote: >>>>>>>> >>> Hi Olga, >>>>>>>> >>> >>>>>>>> >>> On Wed, Dec 3, 2014 at 2:54 PM, Olga Kornievskaia <aglo@xxxxxxxxx> >>>>>>>> >>> wrote: >>>>>>>> >>>> Hi folks, >>>>>>>> >>>> >>>>>>>> >>>> I would like an opinion about changing code in such as way that we >>>>>>>> >>>> don't select a delegation stateid for an IO operation when this >>>>>>>> >>>> particular delegation is being returned. >>>>>>>> >>>> >>>>>>>> >>>> The reason it's some what problematic is that we send out a >>>>>>>> >>>> DELEG_RETURN and then we don't remove the stateid until we receive a >>>>>>>> >>>> reply. In the mean while, an IO operation can be happening and in >>>>>>>> >>>> nfs4_select_rw_stateid() it sees a delegation stateid and uses it. >>>>>>>> >>>> Well, at the server, it finishes processing DELEG_RETURN before >>>>>>>> >>>> getting an IO op and by that time the server is finished with the >>>>>>>> >>>> stateid and can error an IO operation with BAD_STATEID. >>>>>>>> >>>> >>>>>>>> >>>> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c >>>>>>>> >>>> index 7f3f606..4f6f6c9 100644 >>>>>>>> >>>> --- a/fs/nfs/delegation.c >>>>>>>> >>>> +++ b/fs/nfs/delegation.c >>>>>>>> >>>> @@ -854,7 +854,8 @@ bool nfs4_copy_delegation_stateid(nfs4_stateid >>>>>>>> >>>> *dst, struct inode *in >>>>>>>> >>>> flags &= FMODE_READ|FMODE_WRITE; >>>>>>>> >>>> rcu_read_lock(); >>>>>>>> >>>> delegation = rcu_dereference(nfsi->delegation); >>>>>>>> >>>> - ret = (delegation != NULL && (delegation->type & flags) == >>>>>>>> >>>> flags); >>>>>>>> >>>> + ret = (delegation != NULL && (delegation->type & flags) == >>>>>>>> >>>> flags && >>>>>>>> >>>> + !test_bit(NFS_DELEGATION_RETURNING, >>>>>>>> >>>> &delegation->flags)); >>>>>>>> >>>> if (ret) { >>>>>>>> >>>> nfs4_stateid_copy(dst, &delegation->stateid); >>>>>>>> >>>> nfs_mark_delegation_referenced(delegation); >>>>>>>> >>> >>>>>>>> >>> The above cannot eliminate the possibility that we won't use a >>>>>>>> >>> delegation while it is being returned. It will at best just reduce the >>>>>>>> >>> window of opportunity. >>>>>>>> >>> >>>>>>>> >> >>>>>>>> >> You are right this are still problems. Actually, we might set the bit >>>>>>>> >> but not yet get the open stateid from the open with deleg_cur and >>>>>>>> >> that's not good. It would be good to know we got the open stateid and >>>>>>>> >> then pick that. >>>>>>>> >> >>>>>>>> >>> So, why is this being considered to be a problem in the first place? >>>>>>>> >>> Are you seeing a measurable performance impact on a real life workload >>>>>>>> >>> (as opposed to some 1-in-a-billion occurrence from a QA test :-))? >>>>>>>> >> >>>>>>>> >> Unfortunately, this problem is quite common and I hit it all the time >>>>>>>> >> on my setup. This leads to client seizing IO on that file and >>>>>>>> >> returning EIO. It's an unrecoverable error. I'm trying to figure out >>>>>>>> >> how to eliminate getting to that state. >>>>>>>> >> >>>>>>>> > >>>>>>>> > It definitely isn't intended to be an irrecoverable error. The client >>>>>>>> > is supposed to just replay the write after updating the stateid. >>>>>>>> >>>>>>>> open(deleg_cur) call / reply >>>>>>>> lock() call/reply >>>>>>>> deleg_return() call >>>>>>>> write(with deluge_stateid) call gets BAD_STATEID >>>>>>>> state recovery code marks lock state lost -> EIO. >>>>>>> >>>>>>> Why is it marking the lock as lost? If the recovery succeeded, it should >>>>>>> notice that the stateid has changed and instead retry. >>>>>> >>>>>> I'll get you a better explanation tomorrow besides saying "that's what >>>>>> I see when I run the code". >>>>> >>>>> nfs4_async_handle_error() initiates state recovery >>>>> nfs4_reclaim_open_state() eventually calls nfs4_reclaim_locks() which >>>>> marks the lock LOST. state is delegated so the kernel logs "lock >>>>> reclaim failed". >>>>> write retries and in nfs4_copy_ lock_stateid() the lock is marked LOST >>>>> and the nfs4_select_rw_stateid() fails with EIO. >>>>> >>>>>> >>>>>>> What kernel is this? >>>>>> >>>>>> This is upstream. >>>> >>>> So why isn't nfs4_write_stateid_changed() catching the change before >>>> we even get to nfs4_async_handle_error()? That's where this race is >>>> supposed to get resolved. >>> >>> Probably because nfs4_stateid_is_current() returns true because >>> nfs4_set_rw_stateid() calls the nfs4_select_rw_stateid() and finds the >>> lock lost? >> >> I don't understand why the lock would already be marked as lost. >> Remember that we're calling from inside nfs4_write_done() and before >> calling nfs4_async_handle_error(). > > Sigh. I'm so confused... > > On November 30th I posted "is receiving BAD_STATEID during IO on > delegated stateid unrecoverable?" and you've agreed that it would lead > to the EIO. > > There are a couple of cases i'm juggling and I'm just getting confused > if they are different or same. 1 is using delegation stateid for the > IO while we are returning the stateid and thus getting BAD_STATEID. 2 > is getting a BAD_STATEID on a delegated stateid on IO in without a > deleg_return. It seems like 2nd case can also lead to the EIO. I have > lossy traces from QA so it's possible that it's the 1st case and known > to get the EIO. got my cases mixed up. 2nd case was the november's message and agreed to be unrecoverable. 1st case is what i'm also looking into (maybe). > > >> >> -- >> Trond Myklebust >> >> Linux NFS client maintainer, PrimaryData >> >> trond.myklebust@xxxxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html