Hi folks, The bad_seqid error could have been the bug in 7.4 https://access.redhat.com/solutions/3354251. It's been fixed in kernel-3.10.0-693.23.1.el7. Can you try to update and see if that helps? The bug was client was sending a double close throwing off the seqid use. On Wed, Sep 18, 2019 at 9:07 AM Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: > > On 17 Sep 2019, at 22:20, Leon Kyneur wrote: > > > On Tue, Sep 17, 2019 at 7:28 PM Benjamin Coddington > > <bcodding@xxxxxxxxxx> wrote: > >> > >> On 12 Sep 2019, at 4:27, Leon Kyneur wrote: > >> > >>> Hi > >>> > >>> I'm experiencing an issue on NFS 4.0 + 4.1 where we cannot call > >>> fcntl > >>> locks on any file on the share. The problem goes away if the share > >>> is > >>> umount && mount (mount -o remount does not resolve the issue) > >>> > >>> Client: > >>> EL 7.4 3.10.0-693.5.2.el7.x86_64 nfs-utils-1.3.0-0.48.el7_4.x86_64 > >>> > >>> Server: > >>> EL 7.4 3.10.0-693.5.2.el7.x86_64 nfs-utils-1.3.0-0.48.el7_4.x86_64 > >>> > >>> I can't figure this out but the client reports bad-sequence-id in > >>> dupicate in the logs: > >>> Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad > >>> sequence-id error on an unconfirmed sequence ffff881c52286220! > >>> Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad > >>> sequence-id error on an unconfirmed sequence ffff881c52286220! > >>> Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad > >>> sequence-id error on an unconfirmed sequence ffff8810889cb020! > >>> Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad > >>> sequence-id error on an unconfirmed sequence ffff8810889cb020! > >>> Sep 12 02:17:44 client kernel: NFS: v4 server returned a bad > >>> sequence-id error on an unconfirmed sequence ffff881b414b2620! > >>> > >>> wireshark capture shows only 1 BAD_SEQID reply from the server: > >>> $ tshark -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid -z > >>> proto,colinfo,nfs.seqid,nfs.seqid -R 'rpc.xid == 0x9990c61d' > >>> tshark: -R without -2 is deprecated. For single-pass filtering use > >>> -Y. > >>> 141 93 172.27.30.129 -> 172.27.255.28 NFS 352 V4 Call LOCK > >>> FH: > >>> 0x80589398 Offset: 0 Length: <End of File> nfs.seqid == 0x0000004e > >>> nfs.seqid == 0x00000002 rpc.xid == 0x9990c61d > >>> 142 93 172.27.255.28 -> 172.27.30.129 NFS 124 V4 Reply (Call > >>> In 141) LOCK Status: NFS4ERR_BAD_SEQID rpc.xid == 0x9990c61d > >>> > >>> system call I have identified as triggering it is: > >>> fcntl(3, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741824, > >>> len=1}) = -1 EIO (Input/output error) > >> > >> Can you simplify the trigger into something repeatable? Can you > >> determine > >> if the client or the server has lost track of the sequence? > >> > > > > I have tried, I wrote some code to perform the fcntl RDKLCK the same > > way and ran it accross > > thousands of machines without any success. I am quite sure this is a > > symptom of something > > not the cause. > > > > Is there a better way of tracking sequences other than monitoring the > > network traffic? > > I think that's the best way, right now. We do have tracepoints for > nfs4 open and close that show the sequence numbers on the client, but > I'm > not sure about how to get that from the server side. I don't think we > have > seqid for locks in tracepoints.. I could be missing something. Not only > that, but you might not get tracepoint output showing the sequence > numbers > if you're in an error-handling path. > > If you have a wire capture of the event, you should be able to go > backwards > from the error and figure out what the sequence number on the state > should > be for the operation that received BAD_SEQID by finding the last > sequence-mutating (OPEN,CLOSE,LOCK) operation for that stateid that did > not > return an error. > > Ben