Re: [bug report] task hang while testing xfstests generic/323

Olga Kornievskaia <aglo@xxxxxxxxx> · Mon, 11 Mar 2019 11:14:41 -0400

On Mon, Mar 11, 2019 at 11:12 AM Trond Myklebust
<trondmy@xxxxxxxxxxxxxxx> wrote:
>
> On Mon, 2019-03-11 at 14:30 +0000, Trond Myklebust wrote:
> > Hi Olga,
> >
> > On Sun, 2019-03-10 at 18:20 -0400, Olga Kornievskaia wrote:
> > > There are a bunch of cases where multiple operations are using the
> > > same seqid and slot.
> > >
> > > Example of such weirdness is (nfs.seqid == 0x000002f4) &&
> > > (nfs.slotid
> > > == 0) and the one leading to the hang.
> > >
> > > In frame 415870, there is an OPEN using that seqid and slot for the
> > > first time (but this slot will be re-used a bunch of times before
> > > it
> > > gets a reply in frame 415908 with the open stateid seq=40).  (also
> > > in
> > > this packet there is an example of reuse slot=1+seqid=0x000128f7 by
> > > both TEST_STATEID and OPEN but let's set that aside).
> > >
> > > In frame 415874 (in the same packet), client sends 5 opens on the
> > > SAME
> > > seqid and slot (all have distinct xids). In a ways that's end up
> > > being
> > > alright since opens are for the same file and thus reply out of the
> > > cache and the reply is ERR_DELAY. But in frame 415876, client sends
> > > again uses the same seqid and slot  and in this case it's used by
> > > 3opens and a test_stateid.
>
> This should result in exactly 1 bump of the stateid seqid.
>
> > > Client in all this mess never processes the open stateid seq=40 and
> > > keeps on resending CLOSE with seq=37 (also to note client "missed"
> > > processing seqid=38 and 39 as well. 39 probably because it was a
> > > reply
> > > on the same kind of "Reused" slot=1 and seq=0x000128f7. I haven't
> > > tracked 38 but i'm assuming it's the same). I don't know how many
> > > times but after 5mins, I see a TEST_STATEID that again uses the
> > > same
> > > seqid+slot (which gets a reply from the cache matching OPEN). Also
> > > open + close (still with seq=37) open is replied to but after this
> > > client goes into a soft lockup logs have
> > > "nfs4_schedule_state_manager:
> > > kthread_ruan: -4" over and over again . then a soft lockup.
> > >
> > > Looking back on slot 0. nfs.seqid=0x000002f3 was used in
> > > frame=415866
> > > by the TEST_STATEID. This is replied to in frame 415877 (with an
> > > ERR_DELAY). But before the client got a reply, it used the slot and
> > > the seq by frame 415874. TEST_STATEID is a synchronous and
> > > interruptible operation. I'm suspecting that somehow it was
> > > interrupted and that's who the slot was able to be re-used by the
> > > frame 415874. But how the several opens were able to get the same
> > > slot
> > > I don't know..
> >
> > Is this still true with the current linux-next? I would expect this
> > patch
> > http://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=commitdiff;h=3453d5708b33efe76f40eca1c0ed60923094b971
> > to change the Linux client behaviour in the above regard.
> >
>
>
> Note also that what you describe would appear to indicate a serious
> server bug. If the client is reusing a slot+seqid, then the correct
> server behaviour is to either return one of the errors NFS4ERR_DELAY,
> NFS4ERR_SEQ_FALSE_RETRY, NFS4ERR_RETRY_UNCACHED_REP,
> NFS4ERR_SEQ_MISORDERED, or else it must replay the exact same reply
> that it had cached for the original request.
>
> It is protocol violation for the server to execute new requests on a
> slot+seqid combination that has already been used. For that reason, it
> is also impossible for a replay to cause further state changes on the
> server; a replay means that the server belts out the exact same reply
> that was attempted sent earlier with no changes (with stateids that
> match bit for bit what would have been sent earlier).
>

But it is the same requests because all of them are opens on the same
file same everything.

> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@xxxxxxxxxxxxxxx
>
>