On Tue, Apr 13, 2021 at 3:08 AM Rick Macklem <rmacklem@xxxxxxxxxxx> wrote: > > Hi, > > During testing of a Fedora Core 30 (5.2.10 kernel) against a FreeBSD > server (4.1 mount), I have been simulating a network partitioning > for a few minutes (until the TCP connection goes to SYN_SENT on > the Linux client). > > Sometimes, after the network partition heals, the FreeBSD server > replies NFS4ERR_SEQ_MISORDERED. > Looking at the packet trace, the seqid for the slot has advanced by > 2 instead of 1. An RPC request for old-seqid + 1 never seems to get > sent. > --> Since sending an RPC with "seqid + 2" but never sending one > that is "seqid + 1" for a slot seems harmless, I have added an optional > hack (can be turned off), to allow this case instead of replying > NFS4ERR_SEQ_MISORDERED for it. The code will still reply > NFS4ERR_SEQ_MISORDERED if an RPC for the slot with > "old seqid + 1" in it. > --> Yes, doing this hack is a violation of RFC5661, but I've > done it anyhow. > > If you are interested in a packet capture with this in it: > fetch https://people.freebsd.org/~rmacklem/linuxtofreenfs.pcap > - then look at packet #1945 and #2072 > --> You'll see that slot #1 seqid goes from 4 to 6. There is no > slot#1 seqid 5 RPC sent on the wire. > (This packet capture was taken on the Linux client using > tcpdump.) > --> Btw, the "RST battle" you'll see in the above trace between > #2005 and #2068 that goes on until the FreeBSD > krpc/NFS times out the connection after 6min. seems to be a recent > FreeBSD TCP bug. > I have reproduced this seqid advances by 2 on an older system > that does not "RST battle" and allows the reconnect right away, > once the network partition is healed, so it does seem to be > relevant to this bug. > > Someday, I will get around to upgrading to a more recent Linux > kernel and will test to see if I can still reproduce this bug. > On 5.2.10, it is intermittent and does not occur every time I > do the network partitioning test. > > Mostly just fyi, rick Hi Rick, I think this is happening because slotid=1 had something queued up using seqid=5 and that was interrupted because the connection was RSTed. For the interrupted slot, the client would send solo SEQUENCE with +1 seqid.