Re: troubleshooting LOCK FH and NFS4ERR_BAD_SEQID

"Benjamin Coddington" <bcodding@xxxxxxxxxx> · Tue, 17 Sep 2019 07:28:43 -0400

On 12 Sep 2019, at 4:27, Leon Kyneur wrote:

> Hi
>
> I'm experiencing an issue on NFS 4.0 + 4.1 where we cannot call fcntl
> locks on any file on the share. The problem goes away if the share is
> umount && mount (mount -o remount does not resolve the issue)
>
> Client:
> EL 7.4 3.10.0-693.5.2.el7.x86_64 nfs-utils-1.3.0-0.48.el7_4.x86_64
>
> Server:
> EL 7.4 3.10.0-693.5.2.el7.x86_64  nfs-utils-1.3.0-0.48.el7_4.x86_64
>
> I can't figure this out but the client reports bad-sequence-id in
> dupicate in the logs:
> Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad
> sequence-id error on an unconfirmed sequence ffff881c52286220!
> Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad
> sequence-id error on an unconfirmed sequence ffff881c52286220!
> Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad
> sequence-id error on an unconfirmed sequence ffff8810889cb020!
> Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad
> sequence-id error on an unconfirmed sequence ffff8810889cb020!
> Sep 12 02:17:44 client kernel: NFS: v4 server returned a bad
> sequence-id error on an unconfirmed sequence ffff881b414b2620!
>
> wireshark capture shows only 1 BAD_SEQID reply from the server:
> $ tshark -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid -z
> proto,colinfo,nfs.seqid,nfs.seqid -R 'rpc.xid == 0x9990c61d'
> tshark: -R without -2 is deprecated. For single-pass filtering use -Y.
> 141         93 172.27.30.129 -> 172.27.255.28 NFS 352 V4 Call LOCK FH:
> 0x80589398 Offset: 0 Length: <End of File>  nfs.seqid == 0x0000004e
> nfs.seqid == 0x00000002  rpc.xid == 0x9990c61d
> 142         93 172.27.255.28 -> 172.27.30.129 NFS 124 V4 Reply (Call
> In 141) LOCK Status: NFS4ERR_BAD_SEQID  rpc.xid == 0x9990c61d
>
> system call I have identified as triggering it is:
> fcntl(3, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741824,
> len=1}) = -1 EIO (Input/output error)

Can you simplify the trigger into something repeatable?  Can you determine
if the client or the server has lost track of the sequence?

> The server filesystem is ZFS though NFS sharing is turned off via ZFS
> options and it's exported using /etc/exports / nfsd...
>
> The BAD_SEQID error seems to be fairly random, we have over 2000
> machines connected to the share and it's experienced frequently but
> randomly accross our clients.
>
> It's worth mentioning that the majority of the clients are mounting
> 4.0 we did try 4.1 everywhere but hit this
> https://access.redhat.com/solutions/3146191

This was fixed in kernel-3.10.0-735.el7, FWIW..

Ben