Re: releasing result pages in svc_xprt_release()

Chuck Lever <chuck.lever@xxxxxxxxxx> · Fri, 5 Feb 2021 20:20:28 +0000

> On Feb 1, 2021, at 6:27 PM, NeilBrown <neilb@xxxxxxx> wrote:
> 
> On Mon, Feb 01 2021, Chuck Lever wrote:
> 
>>> On Jan 31, 2021, at 6:45 PM, NeilBrown <neilb@xxxxxxx> wrote:
>>> 
>>> On Fri, Jan 29 2021, Chuck Lever wrote:
>>>> 
>>>> What's your opinion?
>>> 
>>> To form a coherent opinion, I would need to know what that problem is.
>>> I certainly accept that there could be performance problems in releasing
>>> and re-allocating pages which might be resolved by batching, or by copying,
>>> or by better tracking.  But without knowing what hot-spot you want to
>>> cool down, I cannot think about how that fits into the big picture.
>>> So: what exactly is the problem that you see?
>> 
>> The problem is that each 1MB NFS READ, for example, hands 257 pages
>> back to the page allocator, then allocates another 257 pages. One page
>> is not terrible, but 510+ allocator calls for every RPC begins to get
>> costly.
>> 
>> Also, remember that both allocating and freeing a page means an irqsave
>> spin lock -- that will serialize all other page allocations, including
>> allocation done by other nfsd threads.
>> 
>> So I'd like to lower page allocator contention and the rate at which
>> IRQs are disabled and enabled when the NFS server becomes busy, as it
>> might with several 25 GbE NICs, for instance.
>> 
>> Holding onto the same pages means we can make better use of TLB
>> entries -- fewer TLB flushes is always a good thing.
>> 
>> I know that the network folks at Red Hat have been staring hard at
>> reducing memory allocation in the stack for several years. I recall
>> that Matthew Wilcox recently made similar improvements to the block
>> layer.
>> 
>> With the advent of 100GbE and Optane-like durable storage, the balance
>> of memory allocation cost to device latency has shifted so that
>> superfluous memory allocation is noticeable.
>> 
>> 
>> At first I thought of creating a page allocator API that could grab
>> or free an array of pages while taking the allocator locks once. But
>> now I wonder if there are opportunities to reduce the amount of page
>> allocator traffic.
> 
> Thanks.  This helps me a lot.
> 
> I wonder if there is some low-hanging fruit here.
> 
> If I read the code correctly (which is not certain, but what I see does
> seem to agree with vague memories of how it all works), we currently do
> a lot of wasted alloc/frees for zero-copy reads.
> 
> We allocate lots of pages and store the pointers in ->rq_respages
> (i.e. ->rq_pages)
> Then nfsd_splice_actor frees many of those pages and
> replaces the pointers with pointers to page-cache pages.  Then we release
> those page-cache pages.
> 
> We need to have allocated them, but we don't need to free them.
> We can add some new array for storing them, have nfsd_splice_actor move
> them to that array, and have svc_alloc_arg() move pages back from the
> store rather than re-allocating them.
> 
> Or maybe something even more sophisticated where we only move them out
> of the store when we actually need them.
> 
> Having the RDMA layer return pages when they are finished with might
> help.  You might even be able to use atomics (cmpxchg) to handle the
> contention.  But I'm not convinced it would be worth it.
> 
> I *really* like your idea of a batch-API for page-alloc and page-free.
> This would likely be useful for other users, and it would be worth
> writing more code to get peak performance - things such as per-cpu
> queues of returned pages and so-forth (which presumably already exist).

Baby steps.

Because I'm perverse I started with bulk page freeing. In the course
of trying to invent a new API, I discovered there already is a batch
free_page() function called release_pages().

It seems to work as advertised for pages that are truly no longer
in use (ie RPC/RDMA pages) but not for pages that are still in flight
but released (ie TCP pages).

release_pages() chains the pages in the passed-in array onto a list
by their page->lru fields. This seems to be a problem if a page
is still in use.

> I cannot be sure that the batch-API would be better than a focused API
> just for RDMA -> NFSD.  But my guess is that it would be at least nearly
> as good, and would likely get a lot more eyes on the code.
> 
> NeilBrown

--
Chuck Lever