Re: releasing result pages in svc_xprt_release()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On Jan 31, 2021, at 6:45 PM, NeilBrown <neilb@xxxxxxx> wrote:
> 
> On Fri, Jan 29 2021, Chuck Lever wrote:
> 
>>> On Jan 29, 2021, at 5:43 PM, NeilBrown <neilb@xxxxxxx> wrote:
>>> 
>>> On Fri, Jan 29 2021, Chuck Lever wrote:
>>> 
>>>> Hi Neil-
>>>> 
>>>> I'd like to reduce the amount of page allocation that NFSD does,
>>>> and was wondering about the release and reset of pages in
>>>> svc_xprt_release(). This logic was added when the socket transport
>>>> was converted to use kernel_sendpage() back in 2002. Do you
>>>> remember why releasing the result pages is necessary?
>>>> 
>>> 
>>> Hi Chuck,
>>> as I recall, kernel_sendpage() (or sock->ops->sendpage() as it was
>>> then) takes a reference to the page and will hold that reference until
>>> the content has been sent and ACKed.  nfsd has no way to know when the
>>> ACK comes, so cannot know when the page can be re-used, so it must
>>> release the page and allocate a new one.
>>> 
>>> This is the price we pay for zero-copy, and I acknowledge that it is a
>>> real price.  I wouldn't be surprised if the trade-offs between
>>> zero-copy and single-copy change over time, and between different
>>> hardware.
>> 
>> Very interesting, thanks for the history! Two observations:
>> 
>> - I thought without MSG_DONTWAIT, the sendpage operation would be
>> total synchronous -- when the network layer was done with retransmissions,
>> it would unblock the caller. But that's likely a mistaken assumption
>> on my part. That could be why sendmsg is so much slower than sendpage
>> in this particular application.
>> 
> 
> On the "send" side, I think MSG_DONTWAIT is primarily about memory
> allocation.  send_msg() can only return when the message is queued.  If
> it needs to allocate memory (or wait for space in a restricted queue),
> then MSG_DONTWAIT says "fail instead".  It certainly doesn't wait for
> successful xmit and ack.

Fair enough.


> On the "recv" side it is quite different of course.
> 
>> - IIUC, nfsd_splice_read() replaces anonymous pages in rq_pages with
>> actual page cache pages. Those of course cannot be used to construct
>> subsequent RPC Replies, so that introduces a second release requirement.
> 
> Yep.  I wonder if those pages are protected against concurrent updates
> .. so that a computed checksum will remain accurate.

That thought has been lingering in the back of my mind too. But the
server has used sendpage() for many years without a reported issue
(since RQ_SPLICE_OK was added).


>> So I have a way to make the first case unnecessary for RPC/RDMA. It
>> has a reliable Send completion mechanism. Sounds like releasing is
>> still necessary for TCP, though; maybe that could be done in the
>> xpo_release_rqst callback.
> 
> It isn't clear to me what particular cost you are trying to reduce.  Is
> handing a page back from RDMA to nfsd cheaper than nfsd calling
> alloc_page(), or do you hope to keep batches of pages together to avoid
> multi-page overheads, or is this about cache-hot pages, or ???

RDMA gives consumers a reliable indication that the NIC is done with
each page. There's really no need to cycle the page at all (except
for the splice case).

I outline the savings below.


>> As far as nfsd_splice_read(), I had thought of moving those pages to
>> a separate array which would always be released. That would need to
>> deal with the transport requirements above.
>> 
>> If nothing else, I would like to add mention of these requirements
>> somewhere in the code too.
> 
> Strongly agree with that.
> 
>> 
>> What's your opinion?
> 
> To form a coherent opinion, I would need to know what that problem is.
> I certainly accept that there could be performance problems in releasing
> and re-allocating pages which might be resolved by batching, or by copying,
> or by better tracking.  But without knowing what hot-spot you want to
> cool down, I cannot think about how that fits into the big picture.
> So: what exactly is the problem that you see?

The problem is that each 1MB NFS READ, for example, hands 257 pages
back to the page allocator, then allocates another 257 pages. One page
is not terrible, but 510+ allocator calls for every RPC begins to get
costly.

Also, remember that both allocating and freeing a page means an irqsave
spin lock -- that will serialize all other page allocations, including
allocation done by other nfsd threads.

So I'd like to lower page allocator contention and the rate at which
IRQs are disabled and enabled when the NFS server becomes busy, as it
might with several 25 GbE NICs, for instance.

Holding onto the same pages means we can make better use of TLB
entries -- fewer TLB flushes is always a good thing.

I know that the network folks at Red Hat have been staring hard at
reducing memory allocation in the stack for several years. I recall
that Matthew Wilcox recently made similar improvements to the block
layer.

With the advent of 100GbE and Optane-like durable storage, the balance
of memory allocation cost to device latency has shifted so that
superfluous memory allocation is noticeable.


At first I thought of creating a page allocator API that could grab
or free an array of pages while taking the allocator locks once. But
now I wonder if there are opportunities to reduce the amount of page
allocator traffic.


--
Chuck Lever







[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux