We have been examining NFSv3 read IO errors that exist in Ubuntu Azure 18.04, 20.04, 21.04 releases as well as RHEL 8.2. The simplest reproduction involves reads of two files immediately after a mount. The server is set to **not** return post op attributes for file1.dat. The server is set to return post op attributes for file2.dat. ```````````````````````````````````````````````````````````````````````````````` [root@client-linux-1 ~]# date ; sudo mount -o vers=3,proto=tcp,mountproto=tcp ${SERVER_IP}:/1_1_1_0 /tmp/mnt Fri Oct 29 17:41:36 UTC 2021 [root@client-linux-1 ~]# md5sum /tmp/mnt/file1.dat 0829f71740aab1ab98b33eae21dee122 /tmp/mnt/file1.dat [root@client-linux-1 ~]# date ; md5sum /tmp/mnt/file2.dat Fri Oct 29 17:42:12 UTC 2021 md5sum: /tmp/mnt/file2.dat: Input/output error ```````````````````````````````````````````````````````````````````````````````` This path seems to set a per mount buffer offset based on the size of the first read response after mount. The offset based on a read with no post op attributes does not permit subsequent read replies with post op attributes to be decoded. ```````````````````````````````````````````````````````````````````````````````` fs/nfs/nfs3proc.c static int nfs3_read_done(struct rpc_task *task, struct nfs_pgio_header *hdr) { struct inode *inode = hdr->inode; struct nfs_server *server = NFS_SERVER(inode); if (hdr->pgio_done_cb != NULL) return hdr->pgio_done_cb(task, hdr); if (nfs3_async_handle_jukebox(task, inode)) return -EAGAIN; if (task->tk_status >= 0 && !server->read_hdrsize) cmpxchg(&server->read_hdrsize, 0, hdr->res.replen); <--- nfs_invalidate_atime(inode); nfs_refresh_inode(inode, &hdr->fattr); return 0; } ```````````````````````````````````````````````````````````````````````````````` Is this a known issue? Detailed reproduction: Azure Ubunutu 21.04 / Linux 5.11.0 client. ```````````````````````````````````````````````````````````````````````````````` [root@client-linux-1 ~]# uname -a Linux client-linux-1 5.11.0-1020-azure #21-Ubuntu SMP Mon Oct 11 18:05:46 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux ```````````````````````````````````````````````````````````````````````````````` Force NFSv3 read responses to not return post op attributes. Start Linux NFS client trace. ```````````````````````````````````````````````````````````````````````````````` [root@client-linux-1 ~]# date ; trace-cmd record -e nfs -e sunrpc Fri Oct 29 17:40:47 UTC 2021 Hit Ctrl^C to stop recording ```````````````````````````````````````````````````````````````````````````````` Start packet capture ```````````````````````````````````````````````````````````````````````````````` [root@client-linux-1 ~]# date ; tshark -n -w /tmp/read_eio.pcap -f "host ${SERVER_IP} and tcp port 2049" Fri Oct 29 17:41:20 UTC 2021 Capturing on 'eth0' ```````````````````````````````````````````````````````````````````````````````` Mount NFSv3 server ```````````````````````````````````````````````````````````````````````````````` [root@client-linux-1 ~]# date ; sudo mount -o vers=3,proto=tcp,mountproto=tcp ${SERVER_IP}:/1_1_1_0 /tmp/mnt Fri Oct 29 17:41:36 UTC 2021 ```````````````````````````````````````````````````````````````````````````````` Trigger the first NFSv3 read for the mount point. ```````````````````````````````````````````````````````````````````````````````` [root@client-linux-1 ~]# md5sum /tmp/mnt/file1.dat 0829f71740aab1ab98b33eae21dee122 /tmp/mnt/file1.dat ```````````````````````````````````````````````````````````````````````````````` Allow NFSv3 read responses to return post op attributes. Read second file. ```````````````````````````````````````````````````````````````````````````````` [root@client-linux-1 ~]# date ; md5sum /tmp/mnt/file2.dat Fri Oct 29 17:42:12 UTC 2021 md5sum: /tmp/mnt/file2.dat: Input/output error ```````````````````````````````````````````````````````````````````````````````` Stop Linux NFS client trace ```````````````````````````````````````````````````````````````````````````````` [root@client-linux-1 ~]# date ; trace-cmd record -e nfs -e sunrpc Fri Oct 29 17:40:47 UTC 2021 Hit Ctrl^C to stop recording ^CCPU0 data recorded at offset=0x866000 24576 bytes in size CPU1 data recorded at offset=0x86c000 40960 bytes in size ```````````````````````````````````````````````````````````````````````````````` Stop packet capture ```````````````````````````````````````````````````````````````````````````````` [root@client-linux-1 ~]# date ; tshark -n -w /tmp/read_eio.pcap -f "host ${SERVER_IP} and tcp port 2049" Fri Oct 29 17:41:20 UTC 2021 Capturing on 'eth0' 55 ^C ```````````````````````````````````````````````````````````````````````````````` The read response for the read of file1.dat is frame 42. The read response for the read of file2.dat is frames 51 and 54. ```````````````````````````````````````````````````````````````````````````````` [root@client-linux-1 ~]# tshark -n -r /tmp/read_eio.pcap -Y 'nfs and (rpc.procedure==3 or rpc.procedure==6)' | sed -Ee "s/${CLIENT_IP}/CLIENT/g" | sed -Ee "s/${SERVER_IP}/SERVER/g" 34 13.536263935 CLIENT → SERVER NFS 202 V3 LOOKUP Call, DH: 0x184e15f0/file1.dat 36 13.536864241 SERVER → CLIENT NFS 310 V3 LOOKUP Reply (Call In 34), FH: 0x7b50f703 41 13.537300844 CLIENT → SERVER NFS 206 V3 READ Call, FH: 0x7b50f703 Offset: 0 Len: 8192 42 13.537913850 SERVER → CLIENT NFS 8306 V3 READ Reply (Call In 41) Len: 8192 44 35.735882048 CLIENT → SERVER NFS 202 V3 LOOKUP Call, DH: 0x184e15f0/file2.dat 45 35.736980657 SERVER → CLIENT NFS 310 V3 LOOKUP Reply (Call In 44), FH: 0xea75aa56 50 35.737550062 CLIENT → SERVER NFS 206 V3 READ Call, FH: 0xea75aa56 Offset: 0 Len: 8192 51 35.737796364 SERVER → CLIENT NFS 8390 V3 READ Reply (Call In 50) Len: 8192 53 35.737986566 CLIENT → SERVER NFS 206 V3 READ Call, FH: 0xea75aa56 Offset: 0 Len: 4096 54 35.738280668 SERVER → CLIENT NFS 4294 V3 READ Reply (Call In 53) Len: 4096 ```````````````````````````````````````````````````````````````````````````````` The first NFSv3 read response after the mount for file1.dat does not contain post op attributes. The subsequent NFSv3 read responses for file2.dat do contain post op attributes, however, the NFS client returns EIO to the application. ```````````````````````````````````````````````````````````````````````````````` [root@client-linux-1 ~]# tshark -n -r /tmp/read_eio.pcap -Y 'rpc.procedure==6 and rpc.msgtyp==1' -T fields -e frame.number -e nfs.attributes_follow 42 0 51 1 54 1 [root@client-linux-1 ~]# tshark -n -r /tmp/read_eio.pcap -Y 'rpc.procedure==6 and rpc.msgtyp==1' -V -O nfs | grep attributes_follow attributes_follow: no value (0) attributes_follow: value follows (1) attributes_follow: value follows (1) ```````````````````````````````````````````````````````````````````````````````` Determine the XID values for the NFSv3 read operations. ```````````````````````````````````````````````````````````````````````````````` [root@client-linux-1 ~]# tshark -n -r /tmp/read_eio.pcap -Y 'rpc.procedure==6 and rpc.msgtyp==1' -T fields -e frame.number -e rpc.xid 42 0x6cba5607 51 0x6fba5607 54 0x70ba5607 ```````````````````````````````````````````````````````````````````````````````` The client trace for the read of file1.dat indicates a total size of the data size plus 44 bytes. The client trace for the read of file2.dat indicates a total size of the data size plus 128 bytes. ```````````````````````````````````````````````````````````````````````````````` [root@client-linux-1 ~]# XIDS=$(tshark -n -r /tmp/read_eio.pcap -Y 'rpc.procedure==6 and rpc.msgtyp==1' -T fields -e rpc.xid) [root@client-linux-1 ~]# for XID in ${XIDS} ; do trace-cmd report | grep xs_stream_read_request | grep -Ee ${XID} ; done | sed -Ee "s/${CLIENT_IP}/CLIENT/g" | sed -Ee "s/${SERVER_IP}/SERVER/g" kworker/u5:0-302937 [001] 256207.056646: xs_stream_read_request: peer=[SERVER]:2049 xid=0x6cba5607 copied=8236 reclen=8236 offset=8236 kworker/u5:0-302937 [001] 256229.256612: xs_stream_read_request: peer=[SERVER]:2049 xid=0x6fba5607 copied=8320 reclen=8320 offset=8320 kworker/u5:0-302937 [001] 256229.257024: xs_stream_read_request: peer=[SERVER]:2049 xid=0x70ba5607 copied=4224 reclen=4224 offset=4224 [root@client-linux-1 ~]# python3 -c "print(8236 - 8192)" 44 [root@client-linux-1 ~]# python3 -c "print(8320 - 8192)" 128 [root@client-linux-1 ~]# python3 -c "print(4224 - 4096)" 128 ```````````````````````````````````````````````````````````````````````````````` The number of bytes prior to the data is 128 when post op attributes are present. ```````````````````````````````````````````````````````````````````````````````` 1 rpc.xid 2 rpc.msgtyp 3 rpc.replystat 4 rpc.auth.flavor 5 rpc.auth.length 6 rpc.state_accept 7 nfs.status 8 nfs.attributes_follow 9 nfs.fattr3.type 10 nfs.mode3 11 nfs.fattr3.nlink 12 nfs.fattr3.uid 13 nfs.fattr3.gid 14 nfs.fattr3.size 15 nfs.fattr3.size 16 nfs.fattr3.used 17 nfs.fattr3.used 18 nfs.specdata1 19 nfs.specdata2 20 nfs.fattr3.fsid 21 nfs.fattr3.fsid 22 nfs.fattr3.fileid 23 nfs.fattr3.fileid 24 nfs.atime.sec 25 nfs.atime.nsec 26 nfs.mtime.sec 27 nfs.mtime.nsec 28 nfs.ctime.sec 29 nfs.ctime.nsec 30 nfs.count3 31 nfs.read.eof 32 count ```````````````````````````````````````````````````````````````````````````````` The number of bytes prior to the data is 44 when post op attributes are not present. ```````````````````````````````````````````````````````````````````````````````` 1 rpc.xid 2 rpc.msgtyp 3 rpc.replystat 4 rpc.auth.flavor 5 rpc.auth.length 6 rpc.state_accept 7 nfs.status 8 nfs.attributes_follow 9 nfs.count3 10 nfs.read.eof 11 count ```````````````````````````````````````````````````````````````````````````````` The rpc_xdr_overflow trace lines occur for the two read responses that contain post op attributes. ```````````````````````````````````````````````````````````````````````````````` [root@client-linux-1 ~]# trace-cmd report | grep -Ee '(rpc_xdr_overflow|xs_stream_read_request)' | grep -B 1 -Ee 'rpc_xdr_overflow' | sed -Ee "s/${CLIENT_IP}/CLIENT/g" | sed -Ee "s/${SERVER_IP}/SERVER/g" kworker/u5:0-302937 [001] 256229.256612: xs_stream_read_request: peer=[SERVER]:2049 xid=0x6fba5607 copied=8320 reclen=8320 offset=8320 kworker/u4:1-304067 [001] 256229.256620: rpc_xdr_overflow: task:13314@8 nfsv3 READ requested=84 p=0xffff8fc0492339b8 end=0xffff8fc0492339c4 xdr=[0xffff8fc049233998,44]/8192/[0xffff8fc0492339c4,88]/8320 kworker/u5:0-302937 [001] 256229.257024: xs_stream_read_request: peer=[SERVER]:2049 xid=0x70ba5607 copied=4224 reclen=4224 offset=4224 kworker/u4:1-304067 [001] 256229.257031: rpc_xdr_overflow: task:13315@8 nfsv3 READ requested=84 p=0xffff8fc0492339b8 end=0xffff8fc0492339c4 xdr=[0xffff8fc049233998,44]/4096/[0xffff8fc0492339c4,88]/4224 ```````````````````````````````````````````````````````````````````````````````` The buffer offset is initialized to 44 instead of 128 for all reads after the first read. ```````````````````````````````````````````````````````````````````````````````` [root@client-linux-1 ~]# trace-cmd report | grep -Ee 'rpc_xdr_reply_pages' kworker/u4:3-305402 [001] 256207.055989: rpc_xdr_reply_pages: task:13309@8 head=[0xffff8fc049236198,128] page=8192 tail=[0xffff8fc049236218,4] len=0 kworker/u4:1-304067 [001] 256229.256233: rpc_xdr_reply_pages: task:13314@8 head=[0xffff8fc049233998,44] page=8192 tail=[0xffff8fc0492339c4,88] len=0 kworker/u4:1-304067 [001] 256229.256673: rpc_xdr_reply_pages: task:13315@8 head=[0xffff8fc049233998,44] page=4096 tail=[0xffff8fc0492339c4,88] len=0 [root@client-linux-1 ~]# python3 -c "print(128+4)" 132 [root@client-linux-1 ~]# python3 -c "print(44+88)" 132 ```````````````````````````````````````````````````````````````````````````````` The rpc_xdr_overflow trace lines indicate a size of 84 which is the size of an fattr3 structure. ```````````````````````````````````````````````````````````````````````````````` [root@client-linux-1 ~]# trace-cmd report | grep -Ee rpc_xdr_overflow kworker/u4:1-304067 [001] 256229.256620: rpc_xdr_overflow: task:13314@8 nfsv3 READ requested=84 p=0xffff8fc0492339b8 end=0xffff8fc0492339c4 xdr=[0xffff8fc049233998,44]/8192/[0xffff8fc0492339c4,88]/8320 kworker/u4:1-304067 [001] 256229.257031: rpc_xdr_overflow: task:13315@8 nfsv3 READ requested=84 p=0xffff8fc0492339b8 end=0xffff8fc0492339c4 xdr=[0xffff8fc049233998,44]/4096/[0xffff8fc0492339c4,88]/4224 ```````````````````````````````````````````````````````````````````````````````` The head and tail buffers are contiguous and p is 12 bytes from the end of head, so the entire fattr3 is not present in head. ```````````````````````````````````````````````````````````````````````````````` >>> end=0xffff8fc0492339c4 >>> p=0xffff8fc0492339b8 >>> head_iov_base=0xffff8fc049233998 >>> head_iov_len=44 >>> tail_iov_base=0xffff8fc0492339c4 >>> end - p 12 >>> end == head_iov_base + head_iov_len True >>> end == tail_iov_base True >>> head_iov_base + head_iov_len == tail_iov_base True >>> p - head_iov_base 32 ```````````````````````````````````````````````````````````````````````````````` The head offset of 44 causes xdr_copy_to_scratch call rpc_xdr_overflow. ```````````````````````````````````````````````````````````````````````````````` net/sunrpc/xdr.c __be32 * xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes) { __be32 *p; if (unlikely(nbytes == 0)) return xdr->p; if (xdr->p == xdr->end && !xdr_set_next_buffer(xdr)) <--- in this case xdr->p != xdr->end goto out_overflow; p = __xdr_inline_decode(xdr, nbytes); if (p != NULL) return p; return xdr_copy_to_scratch(xdr, nbytes); out_overflow: trace_rpc_xdr_overflow(xdr, nbytes); return NULL; } net/sunrpc/xdr.c static __be32 * __xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes) { unsigned int nwords = XDR_QUADLEN(nbytes); __be32 *p = xdr->p; __be32 *q = p + nwords; if (unlikely(nwords > xdr->nwords || q > xdr->end || q < p)) <--- nwords==21 and xdr->p is 3 words less than xdr->end so "q > xdr->end" is false return NULL; xdr->p = q; xdr->nwords -= nwords; return p; } net/sunrpc/xdr.c static __be32 *xdr_copy_to_scratch(struct xdr_stream *xdr, size_t nbytes) { __be32 *p; char *cpdest = xdr->scratch.iov_base; size_t cplen = (char *)xdr->end - (char *)xdr->p; if (nbytes > xdr->scratch.iov_len) <--- nbytes is 12 and scratch.iov_len is 0 for decode goto out_overflow; <--- this path is taken ... return xdr->scratch.iov_base; out_overflow: trace_rpc_xdr_overflow(xdr, nbytes); return NULL; } ```````````````````````````````````````````````````````````````````````````````` nfs3_xdr_enc_read3args uses `NFS3_readres_sz - NFS3_pagepad_sz` for the buffer offset which is sufficiently large for an AUTH_SYS read response with post op attributes. ```````````````````````````````````````````````````````````````````````````````` fs/nfs/nfs3xdr.c static void nfs3_xdr_enc_read3args(struct rpc_rqst *req, struct xdr_stream *xdr, const void *data) { const struct nfs_pgio_args *args = data; unsigned int replen = args->replen ? args->replen : <--- used for reads after first read NFS3_readres_sz - NFS3_pagepad_sz; <--- used for first read encode_read3args(xdr, args); rpc_prepare_reply_pages(req, args->pages, args->pgbase, args->count, replen); req->rq_rcv_buf.flags |= XDRBUF_READ; } ```````````````````````````````````````````````````````````````````````````````` The replen member is set in nfs3_proc_read_setup() based on a per mount point structure. ```````````````````````````````````````````````````````````````````````````````` fs/nfs/nfs3proc.c static void nfs3_proc_read_setup(struct nfs_pgio_header *hdr, struct rpc_message *msg) { msg->rpc_proc = &nfs3_procedures[NFS3PROC_READ]; hdr->args.replen = NFS_SERVER(hdr->inode)->read_hdrsize; } ```````````````````````````````````````````````````````````````````````````````` The read_hdrsize for the mount point is set by the first read response received from the NFSv3 server. If the first read response does not contain post op attributes then this value is 44. ```````````````````````````````````````````````````````````````````````````````` fs/nfs/nfs3proc.c static int nfs3_read_done(struct rpc_task *task, struct nfs_pgio_header *hdr) { struct inode *inode = hdr->inode; struct nfs_server *server = NFS_SERVER(inode); if (hdr->pgio_done_cb != NULL) return hdr->pgio_done_cb(task, hdr); if (nfs3_async_handle_jukebox(task, inode)) return -EAGAIN; if (task->tk_status >= 0 && !server->read_hdrsize) cmpxchg(&server->read_hdrsize, 0, hdr->res.replen); <--- nfs_invalidate_atime(inode); nfs_refresh_inode(inode, &hdr->fattr); return 0; } ```````````````````````````````````````````````````````````````````````````````` Forcing the read responses to not have post op attributes allows the read to succeed. ```````````````````````````````````````````````````````````````````````````````` [root@client-linux-1 ~]# date ; md5sum /tmp/mnt/file2.dat Fri Oct 29 18:55:41 UTC 2021 md5sum: /tmp/mnt/file2.dat: Input/output error # change server configuration [root@client-linux-1 ~]# date ; md5sum /tmp/mnt/file2.dat Fri Oct 29 18:56:06 UTC 2021 0829f71740aab1ab98b33eae21dee122 /tmp/mnt/file2.dat ````````````````````````````````````````````````````````````````````````````````