On Tue, Sep 17, 2019 at 7:28 PM Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: > > On 12 Sep 2019, at 4:27, Leon Kyneur wrote: > > > Hi > > > > I'm experiencing an issue on NFS 4.0 + 4.1 where we cannot call fcntl > > locks on any file on the share. The problem goes away if the share is > > umount && mount (mount -o remount does not resolve the issue) > > > > Client: > > EL 7.4 3.10.0-693.5.2.el7.x86_64 nfs-utils-1.3.0-0.48.el7_4.x86_64 > > > > Server: > > EL 7.4 3.10.0-693.5.2.el7.x86_64 nfs-utils-1.3.0-0.48.el7_4.x86_64 > > > > I can't figure this out but the client reports bad-sequence-id in > > dupicate in the logs: > > Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad > > sequence-id error on an unconfirmed sequence ffff881c52286220! > > Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad > > sequence-id error on an unconfirmed sequence ffff881c52286220! > > Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad > > sequence-id error on an unconfirmed sequence ffff8810889cb020! > > Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad > > sequence-id error on an unconfirmed sequence ffff8810889cb020! > > Sep 12 02:17:44 client kernel: NFS: v4 server returned a bad > > sequence-id error on an unconfirmed sequence ffff881b414b2620! > > > > wireshark capture shows only 1 BAD_SEQID reply from the server: > > $ tshark -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid -z > > proto,colinfo,nfs.seqid,nfs.seqid -R 'rpc.xid == 0x9990c61d' > > tshark: -R without -2 is deprecated. For single-pass filtering use -Y. > > 141 93 172.27.30.129 -> 172.27.255.28 NFS 352 V4 Call LOCK FH: > > 0x80589398 Offset: 0 Length: <End of File> nfs.seqid == 0x0000004e > > nfs.seqid == 0x00000002 rpc.xid == 0x9990c61d > > 142 93 172.27.255.28 -> 172.27.30.129 NFS 124 V4 Reply (Call > > In 141) LOCK Status: NFS4ERR_BAD_SEQID rpc.xid == 0x9990c61d > > > > system call I have identified as triggering it is: > > fcntl(3, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741824, > > len=1}) = -1 EIO (Input/output error) > > Can you simplify the trigger into something repeatable? Can you determine > if the client or the server has lost track of the sequence? > I have tried, I wrote some code to perform the fcntl RDKLCK the same way and ran it accross thousands of machines without any success. I am quite sure this is a symptom of something not the cause. Is there a better way of tracking sequences other than monitoring the network traffic? > > The server filesystem is ZFS though NFS sharing is turned off via ZFS > > options and it's exported using /etc/exports / nfsd... > > > > The BAD_SEQID error seems to be fairly random, we have over 2000 > > machines connected to the share and it's experienced frequently but > > randomly accross our clients. > > > > It's worth mentioning that the majority of the clients are mounting > > 4.0 we did try 4.1 everywhere but hit this > > https://access.redhat.com/solutions/3146191 > > This was fixed in kernel-3.10.0-735.el7, FWIW.. > > Ben Thanks good to know, am planning an update soon but have been stuck on 3.10.0-693 for other reasons. Leon