Re: releasing result pages in svc_xprt_release()

NeilBrown <neilb@xxxxxxx> · Tue, 02 Feb 2021 10:27:25 +1100

On Mon, Feb 01 2021, Chuck Lever wrote:

>> On Jan 31, 2021, at 6:45 PM, NeilBrown <neilb@xxxxxxx> wrote:
>> 
>> On Fri, Jan 29 2021, Chuck Lever wrote:
>>> 
>>> What's your opinion?
>> 
>> To form a coherent opinion, I would need to know what that problem is.
>> I certainly accept that there could be performance problems in releasing
>> and re-allocating pages which might be resolved by batching, or by copying,
>> or by better tracking.  But without knowing what hot-spot you want to
>> cool down, I cannot think about how that fits into the big picture.
>> So: what exactly is the problem that you see?
>
> The problem is that each 1MB NFS READ, for example, hands 257 pages
> back to the page allocator, then allocates another 257 pages. One page
> is not terrible, but 510+ allocator calls for every RPC begins to get
> costly.
>
> Also, remember that both allocating and freeing a page means an irqsave
> spin lock -- that will serialize all other page allocations, including
> allocation done by other nfsd threads.
>
> So I'd like to lower page allocator contention and the rate at which
> IRQs are disabled and enabled when the NFS server becomes busy, as it
> might with several 25 GbE NICs, for instance.
>
> Holding onto the same pages means we can make better use of TLB
> entries -- fewer TLB flushes is always a good thing.
>
> I know that the network folks at Red Hat have been staring hard at
> reducing memory allocation in the stack for several years. I recall
> that Matthew Wilcox recently made similar improvements to the block
> layer.
>
> With the advent of 100GbE and Optane-like durable storage, the balance
> of memory allocation cost to device latency has shifted so that
> superfluous memory allocation is noticeable.
>
>
> At first I thought of creating a page allocator API that could grab
> or free an array of pages while taking the allocator locks once. But
> now I wonder if there are opportunities to reduce the amount of page
> allocator traffic.

Thanks.  This helps me a lot.

I wonder if there is some low-hanging fruit here.

If I read the code correctly (which is not certain, but what I see does
seem to agree with vague memories of how it all works), we currently do
a lot of wasted alloc/frees for zero-copy reads.

We allocate lots of pages and store the pointers in ->rq_respages
(i.e. ->rq_pages)
Then nfsd_splice_actor frees many of those pages and
replaces the pointers with pointers to page-cache pages.  Then we release
those page-cache pages.

We need to have allocated them, but we don't need to free them.
We can add some new array for storing them, have nfsd_splice_actor move
them to that array, and have svc_alloc_arg() move pages back from the
store rather than re-allocating them.

Or maybe something even more sophisticated where we only move them out
of the store when we actually need them.

Having the RDMA layer return pages when they are finished with might
help.  You might even be able to use atomics (cmpxchg) to handle the
contention.  But I'm not convinced it would be worth it.

I *really* like your idea of a batch-API for page-alloc and page-free.
This would likely be useful for other users, and it would be worth
writing more code to get peak performance - things such as per-cpu
queues of returned pages and so-forth (which presumably already exist).

I cannot be sure that the batch-API would be better than a focused API
just for RDMA -> NFSD.  But my guess is that it would be at least nearly
as good, and would likely get a lot more eyes on the code.

NeilBrown
Attachment:
signature.asc

Description: PGP signature