Re: Why is NFSv4.2 READ_PLUS off for RDMA transport?

Chuck Lever <chuck.lever@xxxxxxxxxx> · Thu, 2 Jan 2025 10:38:18 -0500

On 1/2/25 9:32 AM, Cedric Blancher wrote:
Good afternoon!

Why is NFSv4.2 READ_PLUS off for RDMA transport?

fs/nfs/nfs4client.c has this code:
  1084  void nfs4_server_set_init_caps(struct nfs_server *server)
  1085  {
  1086          /* Set the basic capabilities */
...
  1090          if (server->nfs_client->cl_proto == XPRT_TRANSPORT_RDMA)
  1091                  server->caps &= ~NFS_CAP_READ_PLUS;

Why?

It's complicated.

First, I refer you to Section 6 of RFC 8267:

   https://www.rfc-editor.org/rfc/rfc8267.html#section-6

Which specifies the data items that are permitted to use explicit RDMA
operations (RDMA Read and Write). Note that none of the fields in an
NFSv4.2 READ_PLUS result are permitted to use an explicit RDMA Write.
Therefore, the upper layer binding does not permit READ_PLUS to use
offloaded data transfer of specific data fields.

Why?

The way an NFS READ-like operation works on RDMA transports is that the
client registers the READ buffers with its NIC, then it tells the server
how to write into them (via an Rkey and offset).

Recall that for READ_PLUS, servers are allowed to return any number and
any mix of hole or data content segments. The client doesn't know in 
advance how the server will lay out the reply.

For NFS4_CONTENT_DATA, the client would have to know in advance that
the server planned to return content segments (which might be suitable
for offloaded transfer). It would have to know how large these segments
might be and how many of them there are. This is in order that the
client can register Write chunks for each returned data content segment.

We don't have NFS-over-telepathy yet, so the client can't know any of
this information in advance.

For NFS4_CONTENT_HOLE, of course, offload doesn't make sense. There's
no data to transfer to the client.

Not only that, the client is responsible for expanding hole segments
itself. It has to zero its own memory.

With NFS/RDMA, we really want the server to do that work -- in other
words, it should write the zeroes into the client's memory so that the
client's host CPU doesn't have to bother with it. Getting the server
to fill in client memory is the point of using NFS/RDMA.

----

So, based on this thinking, it was decided that for NFSv4.2 on RDMA
transports, the Linux NFS client won't emit READ_PLUS at all. This is
an implementation choice, not based on a spec requirement: of course
READ_PLUS can go over RDMA. But it won't be terribly efficient and
will involve touching or moving the file content by the CPUs on both the
client and server, which is what using RDMA tries to avoid.

HTH.

--
Chuck Lever