On Fri, Jan 29 2021, Chuck Lever wrote: >> On Jan 29, 2021, at 5:43 PM, NeilBrown <neilb@xxxxxxx> wrote: >> >> On Fri, Jan 29 2021, Chuck Lever wrote: >> >>> Hi Neil- >>> >>> I'd like to reduce the amount of page allocation that NFSD does, >>> and was wondering about the release and reset of pages in >>> svc_xprt_release(). This logic was added when the socket transport >>> was converted to use kernel_sendpage() back in 2002. Do you >>> remember why releasing the result pages is necessary? >>> >> >> Hi Chuck, >> as I recall, kernel_sendpage() (or sock->ops->sendpage() as it was >> then) takes a reference to the page and will hold that reference until >> the content has been sent and ACKed. nfsd has no way to know when the >> ACK comes, so cannot know when the page can be re-used, so it must >> release the page and allocate a new one. >> >> This is the price we pay for zero-copy, and I acknowledge that it is a >> real price. I wouldn't be surprised if the trade-offs between >> zero-copy and single-copy change over time, and between different >> hardware. > > Very interesting, thanks for the history! Two observations: > > - I thought without MSG_DONTWAIT, the sendpage operation would be > total synchronous -- when the network layer was done with retransmissions, > it would unblock the caller. But that's likely a mistaken assumption > on my part. That could be why sendmsg is so much slower than sendpage > in this particular application. > On the "send" side, I think MSG_DONTWAIT is primarily about memory allocation. send_msg() can only return when the message is queued. If it needs to allocate memory (or wait for space in a restricted queue), then MSG_DONTWAIT says "fail instead". It certainly doesn't wait for successful xmit and ack. On the "recv" side it is quite different of course. > - IIUC, nfsd_splice_read() replaces anonymous pages in rq_pages with > actual page cache pages. Those of course cannot be used to construct > subsequent RPC Replies, so that introduces a second release requirement. Yep. I wonder if those pages are protected against concurrent updates .. so that a computed checksum will remain accurate. > > So I have a way to make the first case unnecessary for RPC/RDMA. It > has a reliable Send completion mechanism. Sounds like releasing is > still necessary for TCP, though; maybe that could be done in the > xpo_release_rqst callback. It isn't clear to me what particular cost you are trying to reduce. Is handing a page back from RDMA to nfsd cheaper than nfsd calling alloc_page(), or do you hope to keep batches of pages together to avoid multi-page overheads, or is this about cache-hot pages, or ??? > > As far as nfsd_splice_read(), I had thought of moving those pages to > a separate array which would always be released. That would need to > deal with the transport requirements above. > > If nothing else, I would like to add mention of these requirements > somewhere in the code too. Strongly agree with that. > > What's your opinion? To form a coherent opinion, I would need to know what that problem is. I certainly accept that there could be performance problems in releasing and re-allocating pages which might be resolved by batching, or by copying, or by better tracking. But without knowing what hot-spot you want to cool down, I cannot think about how that fits into the big picture. So: what exactly is the problem that you see? Thanks, NeilBrown > > > -- > Chuck Lever
Attachment:
signature.asc
Description: PGP signature