Re: alloc_pages_bulk()

Chuck Lever <chuck.lever@xxxxxxxxxx> · Tue, 9 Feb 2021 22:55:35 +0000

> On Feb 9, 2021, at 5:01 PM, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> 
> On Mon, Feb 08, 2021 at 05:50:51PM +0000, Chuck Lever wrote:
>>> We've been discussing how NFSD can more efficiently refill its
>>> receive buffers (currently alloc_page() in a loop; see
>>> net/sunrpc/svc_xprt.c::svc_alloc_arg()).
> 
> I'm not familiar with the sunrpc architecture, but this feels like you're
> trying to optimise something that shouldn't exist.  Ideally a write
> would ask the page cache for the pages that correspond to the portion
> of the file which is being written to.  I appreciate that doesn't work
> well for, eg, NFS-over-TCP, but for NFS over any kind of RDMA, that
> should be possible, right?

(Note there is room for improvement for both transport types).

Since you asked ;-) there are four broad categories of NFSD I/O.

1.  Receive an ingress RPC message (typically a Call)

2.  Read from a file

3.  Write to a file

4.  Send an egress RPC message (typically a Reply)

A server RPC transaction is some combination of these, usually
1, 2, and 4; or 1, 3, and 4.

To do 1, the server allocates a set of order-0 pages to form a
receive buffer and a set of order-0 pages to form a send buffer.
We want to handle this with bulk allocation. The Call is then
received into the receive buffer pages.

The receive buffer pages typically stay with the nfsd thread for
its lifetime, but the send buffer pages do not. We want to use a
buffer page size that matches the page cache size (see below) and
also a size small enough that makes allocation very unlikely to
fail. The largest transactions (NFS READ and WRITE) use up to 1MB
worth of pages.

Category two can be done by copying the file's pages into the send
buffer pages, or it can be done via a splice. When a splice is
done, the send buffer pages allocated above are released first
before being replaced in the buffer with the file's pages.

3 is currently done only by copying receive buffer pages to file
pages. Pages are neither allocated or released by this category
of I/O. There are various reasons for this, but it's an area that
could stand some attention.

Sending (category 4) passes the send buffer pages to kernel_sendpage(),
which bumps the page count on them. When sendpage() returns, the
server does a put_page() on all of those pages, then goes back to
category 1 to replace the consumed send buffer pages. When the
network layer is finished with the pages, it releases them.

There are two reasons I can see for this:

1. A network send isn't complete until the server gets an ACK from
the client. This can take a while. I'm not aware of a TCP-provided
mechanism to indicate when the ACK has arrived, so the server can't
re-use them. (RDMA has an affirmative send completion event that
we can use to reduce send buffer churn).

2. If a splice was done, the send buffer pages that are also file
pages can't be re-used for the next RPC send buffer because
overwriting their content would corrupt the file. Thus they must
also be released and replaced.

--
Chuck Lever