Re: [bug report] task hang while testing xfstests generic/323

Olga Kornievskaia <aglo@xxxxxxxxx> · Sun, 10 Mar 2019 18:20:05 -0400

On Wed, Mar 6, 2019 at 11:09 AM bfields@xxxxxxxxxxxx
<bfields@xxxxxxxxxxxx> wrote:
>
> On Wed, Mar 06, 2019 at 05:59:39PM +0800, Jiufei Xue wrote:
> > Hi Trond,
> >
> > >
> > > No, that's quite weird... Were you able to catch this behaviour on
> > > wireshark? It would be interesting to see if the server is actually
> > > sending 2 SEQUENCE ops or if that is a corruption that occurs on the
> > > client.
> > >
> >
> > Many packages are dropped by kernel when I use tcpdump, so I am not sure
> > whether the server is sending 2 SEQUENCE.
>
> For what it's worth, tcpdump with -s200 or so should be enough to get
> this.  Or maybe capturing from a different machine could help?
>
> > However, according to tracepoint nfsd_compound_status, the server is
> > processing SEQUENCE op only once. I think it is more likely blames to
> > the client.
>
> If there was a bug in nfsd's xdr decoding, that'd happen after that
> tracepoint.  That'd be weird corruption, but perhaps all the
> possibilites are weird at this point.

This is not a bug in the Linux nfsd. I was able to reproduce it
against the Ontap server. I ran using the RHEL7.6 client (but it's
upsteram since Jiufei hit it). (I wanted to attach a trace but it
looks like it's too big as my first attempt at posting this was
unsuccessful. I can send a trace to a personal email if interested).

There are a bunch of cases where multiple operations are using the
same seqid and slot.

Example of such weirdness is (nfs.seqid == 0x000002f4) && (nfs.slotid
== 0) and the one leading to the hang.

In frame 415870, there is an OPEN using that seqid and slot for the
first time (but this slot will be re-used a bunch of times before it
gets a reply in frame 415908 with the open stateid seq=40).  (also in
this packet there is an example of reuse slot=1+seqid=0x000128f7 by
both TEST_STATEID and OPEN but let's set that aside).

In frame 415874 (in the same packet), client sends 5 opens on the SAME
seqid and slot (all have distinct xids). In a ways that's end up being
alright since opens are for the same file and thus reply out of the
cache and the reply is ERR_DELAY. But in frame 415876, client sends
again uses the same seqid and slot  and in this case it's used by
3opens and a test_stateid.

Client in all this mess never processes the open stateid seq=40 and
keeps on resending CLOSE with seq=37 (also to note client "missed"
processing seqid=38 and 39 as well. 39 probably because it was a reply
on the same kind of "Reused" slot=1 and seq=0x000128f7. I haven't
tracked 38 but i'm assuming it's the same). I don't know how many
times but after 5mins, I see a TEST_STATEID that again uses the same
seqid+slot (which gets a reply from the cache matching OPEN). Also
open + close (still with seq=37) open is replied to but after this
client goes into a soft lockup logs have "nfs4_schedule_state_manager:
kthread_ruan: -4" over and over again . then a soft lockup.

Looking back on slot 0. nfs.seqid=0x000002f3 was used in frame=415866
by the TEST_STATEID. This is replied to in frame 415877 (with an
ERR_DELAY). But before the client got a reply, it used the slot and
the seq by frame 415874. TEST_STATEID is a synchronous and
interruptible operation. I'm suspecting that somehow it was
interrupted and that's who the slot was able to be re-used by the
frame 415874. But how the several opens were able to get the same slot
I don't know..

Regardless of how we got into the soft lockup. Client shouldn't be
using the same slot+seqid on multiple operations.
>
> --b.
>
> > Is there any other way to catch the behaviour?
> >
> > Thanks,
> > Jiufei
> >