> On Feb 9, 2021, at 5:01 PM, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > On Mon, Feb 08, 2021 at 05:50:51PM +0000, Chuck Lever wrote: >>> We've been discussing how NFSD can more efficiently refill its >>> receive buffers (currently alloc_page() in a loop; see >>> net/sunrpc/svc_xprt.c::svc_alloc_arg()). > > I'm not familiar with the sunrpc architecture, but this feels like you're > trying to optimise something that shouldn't exist. Ideally a write > would ask the page cache for the pages that correspond to the portion > of the file which is being written to. I appreciate that doesn't work > well for, eg, NFS-over-TCP, but for NFS over any kind of RDMA, that > should be possible, right? (Note there is room for improvement for both transport types). Since you asked ;-) there are four broad categories of NFSD I/O. 1. Receive an ingress RPC message (typically a Call) 2. Read from a file 3. Write to a file 4. Send an egress RPC message (typically a Reply) A server RPC transaction is some combination of these, usually 1, 2, and 4; or 1, 3, and 4. To do 1, the server allocates a set of order-0 pages to form a receive buffer and a set of order-0 pages to form a send buffer. We want to handle this with bulk allocation. The Call is then received into the receive buffer pages. The receive buffer pages typically stay with the nfsd thread for its lifetime, but the send buffer pages do not. We want to use a buffer page size that matches the page cache size (see below) and also a size small enough that makes allocation very unlikely to fail. The largest transactions (NFS READ and WRITE) use up to 1MB worth of pages. Category two can be done by copying the file's pages into the send buffer pages, or it can be done via a splice. When a splice is done, the send buffer pages allocated above are released first before being replaced in the buffer with the file's pages. 3 is currently done only by copying receive buffer pages to file pages. Pages are neither allocated or released by this category of I/O. There are various reasons for this, but it's an area that could stand some attention. Sending (category 4) passes the send buffer pages to kernel_sendpage(), which bumps the page count on them. When sendpage() returns, the server does a put_page() on all of those pages, then goes back to category 1 to replace the consumed send buffer pages. When the network layer is finished with the pages, it releases them. There are two reasons I can see for this: 1. A network send isn't complete until the server gets an ACK from the client. This can take a while. I'm not aware of a TCP-provided mechanism to indicate when the ACK has arrived, so the server can't re-use them. (RDMA has an affirmative send completion event that we can use to reduce send buffer churn). 2. If a splice was done, the send buffer pages that are also file pages can't be re-used for the next RPC send buffer because overwriting their content would corrupt the file. Thus they must also be released and replaced. -- Chuck Lever