NFSv3 client returns EIO for valid read responses with post op attributes when first NFSv3 read response for the mount does not have post op attributes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We have been examining NFSv3 read IO errors that exist in Ubuntu Azure 18.04, 20.04, 21.04 releases as well as RHEL 8.2.

The simplest reproduction involves reads of two files immediately after a mount.
The server is set to **not** return post op attributes for file1.dat.
The server is set to return post op attributes for file2.dat.

````````````````````````````````````````````````````````````````````````````````
[root@client-linux-1 ~]# date ; sudo mount -o vers=3,proto=tcp,mountproto=tcp ${SERVER_IP}:/1_1_1_0 /tmp/mnt
Fri Oct 29 17:41:36 UTC 2021

[root@client-linux-1 ~]# md5sum /tmp/mnt/file1.dat
0829f71740aab1ab98b33eae21dee122  /tmp/mnt/file1.dat

[root@client-linux-1 ~]# date ; md5sum /tmp/mnt/file2.dat
Fri Oct 29 17:42:12 UTC 2021
md5sum: /tmp/mnt/file2.dat: Input/output error
````````````````````````````````````````````````````````````````````````````````

This path seems to set a per mount buffer offset based on the size of the first read response after mount.
The offset based on a read with no post op attributes does not permit subsequent read replies with post op attributes to be decoded.

````````````````````````````````````````````````````````````````````````````````
fs/nfs/nfs3proc.c

static int nfs3_read_done(struct rpc_task *task, struct nfs_pgio_header *hdr)
{
        struct inode *inode = hdr->inode;
        struct nfs_server *server = NFS_SERVER(inode);

        if (hdr->pgio_done_cb != NULL)
                return hdr->pgio_done_cb(task, hdr);

        if (nfs3_async_handle_jukebox(task, inode))
                return -EAGAIN;

        if (task->tk_status >= 0 && !server->read_hdrsize)
                cmpxchg(&server->read_hdrsize, 0, hdr->res.replen);   <---

        nfs_invalidate_atime(inode);
        nfs_refresh_inode(inode, &hdr->fattr);
        return 0;
}
````````````````````````````````````````````````````````````````````````````````

Is this a known issue?

Detailed reproduction:

Azure Ubunutu 21.04 / Linux 5.11.0 client.

````````````````````````````````````````````````````````````````````````````````
[root@client-linux-1 ~]# uname -a
Linux client-linux-1 5.11.0-1020-azure #21-Ubuntu SMP Mon Oct 11 18:05:46 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
````````````````````````````````````````````````````````````````````````````````

Force NFSv3 read responses to not return post op attributes.

Start Linux NFS client trace.

````````````````````````````````````````````````````````````````````````````````
[root@client-linux-1 ~]# date ; trace-cmd record -e nfs -e sunrpc
Fri Oct 29 17:40:47 UTC 2021
Hit Ctrl^C to stop recording
````````````````````````````````````````````````````````````````````````````````

Start packet capture

````````````````````````````````````````````````````````````````````````````````
[root@client-linux-1 ~]# date ; tshark -n -w /tmp/read_eio.pcap -f "host ${SERVER_IP} and tcp port 2049"
Fri Oct 29 17:41:20 UTC 2021
Capturing on 'eth0'
````````````````````````````````````````````````````````````````````````````````

Mount NFSv3 server

````````````````````````````````````````````````````````````````````````````````
[root@client-linux-1 ~]# date ; sudo mount -o vers=3,proto=tcp,mountproto=tcp ${SERVER_IP}:/1_1_1_0 /tmp/mnt
Fri Oct 29 17:41:36 UTC 2021
````````````````````````````````````````````````````````````````````````````````

Trigger the first NFSv3 read for the mount point.

````````````````````````````````````````````````````````````````````````````````
[root@client-linux-1 ~]# md5sum /tmp/mnt/file1.dat
0829f71740aab1ab98b33eae21dee122  /tmp/mnt/file1.dat
````````````````````````````````````````````````````````````````````````````````

Allow NFSv3 read responses to return post op attributes.

Read second file.

````````````````````````````````````````````````````````````````````````````````
[root@client-linux-1 ~]# date ; md5sum /tmp/mnt/file2.dat
Fri Oct 29 17:42:12 UTC 2021
md5sum: /tmp/mnt/file2.dat: Input/output error
````````````````````````````````````````````````````````````````````````````````

Stop Linux NFS client trace

````````````````````````````````````````````````````````````````````````````````
[root@client-linux-1 ~]# date ; trace-cmd record -e nfs -e sunrpc
Fri Oct 29 17:40:47 UTC 2021
Hit Ctrl^C to stop recording
^CCPU0 data recorded at offset=0x866000
    24576 bytes in size
CPU1 data recorded at offset=0x86c000
    40960 bytes in size
````````````````````````````````````````````````````````````````````````````````

Stop packet capture

````````````````````````````````````````````````````````````````````````````````
[root@client-linux-1 ~]# date ; tshark -n -w /tmp/read_eio.pcap -f "host ${SERVER_IP} and tcp port 2049"
Fri Oct 29 17:41:20 UTC 2021
Capturing on 'eth0'
55 ^C
````````````````````````````````````````````````````````````````````````````````

The read response for the read of file1.dat is frame 42.
The read response for the read of file2.dat is frames 51 and 54.

````````````````````````````````````````````````````````````````````````````````
[root@client-linux-1 ~]# tshark -n -r /tmp/read_eio.pcap -Y 'nfs and (rpc.procedure==3 or rpc.procedure==6)' | sed -Ee "s/${CLIENT_IP}/CLIENT/g" | sed -Ee "s/${SERVER_IP}/SERVER/g"
   34 13.536263935 CLIENT → SERVER NFS 202 V3 LOOKUP Call, DH: 0x184e15f0/file1.dat
   36 13.536864241 SERVER → CLIENT NFS 310 V3 LOOKUP Reply (Call In 34), FH: 0x7b50f703
   41 13.537300844 CLIENT → SERVER NFS 206 V3 READ Call, FH: 0x7b50f703 Offset: 0 Len: 8192
   42 13.537913850 SERVER → CLIENT NFS 8306 V3 READ Reply (Call In 41) Len: 8192
   44 35.735882048 CLIENT → SERVER NFS 202 V3 LOOKUP Call, DH: 0x184e15f0/file2.dat
   45 35.736980657 SERVER → CLIENT NFS 310 V3 LOOKUP Reply (Call In 44), FH: 0xea75aa56
   50 35.737550062 CLIENT → SERVER NFS 206 V3 READ Call, FH: 0xea75aa56 Offset: 0 Len: 8192
   51 35.737796364 SERVER → CLIENT NFS 8390 V3 READ Reply (Call In 50) Len: 8192
   53 35.737986566 CLIENT → SERVER NFS 206 V3 READ Call, FH: 0xea75aa56 Offset: 0 Len: 4096
   54 35.738280668 SERVER → CLIENT NFS 4294 V3 READ Reply (Call In 53) Len: 4096
````````````````````````````````````````````````````````````````````````````````

The first NFSv3 read response after the mount for file1.dat does not contain post op attributes.
The subsequent NFSv3 read responses for file2.dat do contain post op attributes, however, the NFS client returns EIO to the application.

````````````````````````````````````````````````````````````````````````````````
[root@client-linux-1 ~]# tshark -n -r /tmp/read_eio.pcap -Y 'rpc.procedure==6 and rpc.msgtyp==1' -T fields -e frame.number -e nfs.attributes_follow
42      0
51      1
54      1

[root@client-linux-1 ~]# tshark -n -r /tmp/read_eio.pcap -Y 'rpc.procedure==6 and rpc.msgtyp==1' -V -O nfs | grep attributes_follow
        attributes_follow: no value (0)
        attributes_follow: value follows (1)
        attributes_follow: value follows (1)
````````````````````````````````````````````````````````````````````````````````

Determine the XID values for the NFSv3 read operations.

````````````````````````````````````````````````````````````````````````````````
[root@client-linux-1 ~]# tshark -n -r /tmp/read_eio.pcap -Y 'rpc.procedure==6 and rpc.msgtyp==1' -T fields -e frame.number -e rpc.xid
42      0x6cba5607
51      0x6fba5607
54      0x70ba5607
````````````````````````````````````````````````````````````````````````````````

The client trace for the read of file1.dat indicates a total size of the data size plus 44 bytes.
The client trace for the read of file2.dat indicates a total size of the data size plus 128 bytes.

````````````````````````````````````````````````````````````````````````````````
[root@client-linux-1 ~]# XIDS=$(tshark -n -r /tmp/read_eio.pcap -Y 'rpc.procedure==6 and rpc.msgtyp==1' -T fields -e rpc.xid)

[root@client-linux-1 ~]# for XID in ${XIDS} ; do trace-cmd report | grep xs_stream_read_request | grep -Ee ${XID} ; done |  sed -Ee "s/${CLIENT_IP}/CLIENT/g" | sed -Ee "s/${SERVER_IP}/SERVER/g"
    kworker/u5:0-302937 [001] 256207.056646: xs_stream_read_request: peer=[SERVER]:2049 xid=0x6cba5607 copied=8236 reclen=8236 offset=8236
    kworker/u5:0-302937 [001] 256229.256612: xs_stream_read_request: peer=[SERVER]:2049 xid=0x6fba5607 copied=8320 reclen=8320 offset=8320
    kworker/u5:0-302937 [001] 256229.257024: xs_stream_read_request: peer=[SERVER]:2049 xid=0x70ba5607 copied=4224 reclen=4224 offset=4224

[root@client-linux-1 ~]# python3 -c "print(8236 - 8192)"
44

[root@client-linux-1 ~]# python3 -c "print(8320 - 8192)"
128

[root@client-linux-1 ~]# python3 -c "print(4224 - 4096)"
128
````````````````````````````````````````````````````````````````````````````````

The number of bytes prior to the data is 128 when post op attributes are present.

````````````````````````````````````````````````````````````````````````````````
  1 rpc.xid
  2 rpc.msgtyp
  3 rpc.replystat
  4 rpc.auth.flavor
  5 rpc.auth.length
  6 rpc.state_accept
  7 nfs.status
  8 nfs.attributes_follow
  9 nfs.fattr3.type
 10 nfs.mode3
 11 nfs.fattr3.nlink
 12 nfs.fattr3.uid
 13 nfs.fattr3.gid
 14 nfs.fattr3.size
 15 nfs.fattr3.size
 16 nfs.fattr3.used
 17 nfs.fattr3.used
 18 nfs.specdata1
 19 nfs.specdata2
 20 nfs.fattr3.fsid
 21 nfs.fattr3.fsid
 22 nfs.fattr3.fileid
 23 nfs.fattr3.fileid
 24 nfs.atime.sec
 25 nfs.atime.nsec
 26 nfs.mtime.sec
 27 nfs.mtime.nsec
 28 nfs.ctime.sec
 29 nfs.ctime.nsec
 30 nfs.count3
 31 nfs.read.eof
 32 count
````````````````````````````````````````````````````````````````````````````````

The number of bytes prior to the data is 44 when post op attributes are not present.

````````````````````````````````````````````````````````````````````````````````
  1 rpc.xid
  2 rpc.msgtyp
  3 rpc.replystat
  4 rpc.auth.flavor
  5 rpc.auth.length
  6 rpc.state_accept
  7 nfs.status
  8 nfs.attributes_follow
  9 nfs.count3
 10 nfs.read.eof
 11 count
````````````````````````````````````````````````````````````````````````````````

The rpc_xdr_overflow trace lines occur for the two read responses that contain post op attributes.

````````````````````````````````````````````````````````````````````````````````
[root@client-linux-1 ~]# trace-cmd report | grep -Ee '(rpc_xdr_overflow|xs_stream_read_request)' | grep -B 1 -Ee 'rpc_xdr_overflow' | sed -Ee "s/${CLIENT_IP}/CLIENT/g" | sed -Ee "s/${SERVER_IP}/SERVER/g"
    kworker/u5:0-302937 [001] 256229.256612: xs_stream_read_request: peer=[SERVER]:2049 xid=0x6fba5607 copied=8320 reclen=8320 offset=8320
    kworker/u4:1-304067 [001] 256229.256620: rpc_xdr_overflow:     task:13314@8 nfsv3 READ requested=84 p=0xffff8fc0492339b8 end=0xffff8fc0492339c4 xdr=[0xffff8fc049233998,44]/8192/[0xffff8fc0492339c4,88]/8320
    kworker/u5:0-302937 [001] 256229.257024: xs_stream_read_request: peer=[SERVER]:2049 xid=0x70ba5607 copied=4224 reclen=4224 offset=4224
    kworker/u4:1-304067 [001] 256229.257031: rpc_xdr_overflow:     task:13315@8 nfsv3 READ requested=84 p=0xffff8fc0492339b8 end=0xffff8fc0492339c4 xdr=[0xffff8fc049233998,44]/4096/[0xffff8fc0492339c4,88]/4224
````````````````````````````````````````````````````````````````````````````````

The buffer offset is initialized to 44 instead of 128 for all reads after the first read.

````````````````````````````````````````````````````````````````````````````````
[root@client-linux-1 ~]# trace-cmd report | grep -Ee 'rpc_xdr_reply_pages'
    kworker/u4:3-305402 [001] 256207.055989: rpc_xdr_reply_pages:  task:13309@8 head=[0xffff8fc049236198,128] page=8192 tail=[0xffff8fc049236218,4] len=0
    kworker/u4:1-304067 [001] 256229.256233: rpc_xdr_reply_pages:  task:13314@8 head=[0xffff8fc049233998,44] page=8192 tail=[0xffff8fc0492339c4,88] len=0
    kworker/u4:1-304067 [001] 256229.256673: rpc_xdr_reply_pages:  task:13315@8 head=[0xffff8fc049233998,44] page=4096 tail=[0xffff8fc0492339c4,88] len=0

[root@client-linux-1 ~]# python3 -c "print(128+4)"
132

[root@client-linux-1 ~]# python3 -c "print(44+88)"
132
````````````````````````````````````````````````````````````````````````````````

The rpc_xdr_overflow trace lines indicate a size of 84 which is the size of an fattr3 structure.

````````````````````````````````````````````````````````````````````````````````
[root@client-linux-1 ~]# trace-cmd report | grep -Ee rpc_xdr_overflow
    kworker/u4:1-304067 [001] 256229.256620: rpc_xdr_overflow:     task:13314@8 nfsv3 READ requested=84 p=0xffff8fc0492339b8 end=0xffff8fc0492339c4 xdr=[0xffff8fc049233998,44]/8192/[0xffff8fc0492339c4,88]/8320
    kworker/u4:1-304067 [001] 256229.257031: rpc_xdr_overflow:     task:13315@8 nfsv3 READ requested=84 p=0xffff8fc0492339b8 end=0xffff8fc0492339c4 xdr=[0xffff8fc049233998,44]/4096/[0xffff8fc0492339c4,88]/4224
````````````````````````````````````````````````````````````````````````````````

The head and tail buffers are contiguous and p is 12 bytes from the end of head, so the entire fattr3 is not present in head.

````````````````````````````````````````````````````````````````````````````````
>>> end=0xffff8fc0492339c4
>>> p=0xffff8fc0492339b8
>>> head_iov_base=0xffff8fc049233998
>>> head_iov_len=44
>>> tail_iov_base=0xffff8fc0492339c4

>>> end - p
12

>>> end == head_iov_base + head_iov_len
True

>>> end == tail_iov_base
True

>>> head_iov_base + head_iov_len == tail_iov_base
True

>>> p - head_iov_base
32
````````````````````````````````````````````````````````````````````````````````

The head offset of 44 causes xdr_copy_to_scratch call rpc_xdr_overflow.

````````````````````````````````````````````````````````````````````````````````
net/sunrpc/xdr.c

__be32 * xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes)
{
        __be32 *p;

        if (unlikely(nbytes == 0))
                return xdr->p;
        if (xdr->p == xdr->end && !xdr_set_next_buffer(xdr))       <--- in this case xdr->p != xdr->end
                goto out_overflow;
        p = __xdr_inline_decode(xdr, nbytes);
        if (p != NULL)
                return p;
        return xdr_copy_to_scratch(xdr, nbytes);
out_overflow:
        trace_rpc_xdr_overflow(xdr, nbytes);
        return NULL;
}

net/sunrpc/xdr.c

static __be32 * __xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes)
{
        unsigned int nwords = XDR_QUADLEN(nbytes);
        __be32 *p = xdr->p;
        __be32 *q = p + nwords;

        if (unlikely(nwords > xdr->nwords || q > xdr->end || q < p))     <--- nwords==21 and xdr->p is 3 words less than xdr->end so "q > xdr->end" is false
                return NULL;
        xdr->p = q;
        xdr->nwords -= nwords;
        return p;
}

net/sunrpc/xdr.c

static __be32 *xdr_copy_to_scratch(struct xdr_stream *xdr, size_t nbytes)
{
        __be32 *p;
        char *cpdest = xdr->scratch.iov_base;
        size_t cplen = (char *)xdr->end - (char *)xdr->p;

        if (nbytes > xdr->scratch.iov_len)     <--- nbytes is 12 and scratch.iov_len is 0 for decode
                goto out_overflow;             <--- this path is taken
        ...
        return xdr->scratch.iov_base;
out_overflow:
        trace_rpc_xdr_overflow(xdr, nbytes);
        return NULL;
}
````````````````````````````````````````````````````````````````````````````````

nfs3_xdr_enc_read3args uses `NFS3_readres_sz - NFS3_pagepad_sz` for the buffer offset which is sufficiently large for an AUTH_SYS read response with post op attributes.

````````````````````````````````````````````````````````````````````````````````
fs/nfs/nfs3xdr.c

static void nfs3_xdr_enc_read3args(struct rpc_rqst *req,
                                   struct xdr_stream *xdr,
                                   const void *data)
{
        const struct nfs_pgio_args *args = data;
        unsigned int replen = args->replen ? args->replen :                       <--- used for reads after first read
                                             NFS3_readres_sz - NFS3_pagepad_sz;   <--- used for first read

        encode_read3args(xdr, args);
        rpc_prepare_reply_pages(req, args->pages, args->pgbase,
                                args->count, replen);
        req->rq_rcv_buf.flags |= XDRBUF_READ;
}


````````````````````````````````````````````````````````````````````````````````

The replen member is set in nfs3_proc_read_setup() based on a per mount point structure.

````````````````````````````````````````````````````````````````````````````````
fs/nfs/nfs3proc.c

static void nfs3_proc_read_setup(struct nfs_pgio_header *hdr,
                                 struct rpc_message *msg)
{
        msg->rpc_proc = &nfs3_procedures[NFS3PROC_READ];
        hdr->args.replen = NFS_SERVER(hdr->inode)->read_hdrsize;
}
````````````````````````````````````````````````````````````````````````````````

The read_hdrsize for the mount point is set by the first read response received from the NFSv3 server.
If the first read response does not contain post op attributes then this value is 44.

````````````````````````````````````````````````````````````````````````````````
fs/nfs/nfs3proc.c

static int nfs3_read_done(struct rpc_task *task, struct nfs_pgio_header *hdr)
{
        struct inode *inode = hdr->inode;
        struct nfs_server *server = NFS_SERVER(inode);

        if (hdr->pgio_done_cb != NULL)
                return hdr->pgio_done_cb(task, hdr);

        if (nfs3_async_handle_jukebox(task, inode))
                return -EAGAIN;

        if (task->tk_status >= 0 && !server->read_hdrsize)
                cmpxchg(&server->read_hdrsize, 0, hdr->res.replen);   <---

        nfs_invalidate_atime(inode);
        nfs_refresh_inode(inode, &hdr->fattr);
        return 0;
}
````````````````````````````````````````````````````````````````````````````````

Forcing the read responses to not have post op attributes allows the read to succeed.

````````````````````````````````````````````````````````````````````````````````
[root@client-linux-1 ~]# date ; md5sum /tmp/mnt/file2.dat
Fri Oct 29 18:55:41 UTC 2021
md5sum: /tmp/mnt/file2.dat: Input/output error

# change server configuration

[root@client-linux-1 ~]# date ; md5sum /tmp/mnt/file2.dat
Fri Oct 29 18:56:06 UTC 2021
0829f71740aab1ab98b33eae21dee122  /tmp/mnt/file2.dat
````````````````````````````````````````````````````````````````````````````````




[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux