> On Feb 1, 2021, at 6:27 PM, NeilBrown <neilb@xxxxxxx> wrote: > > On Mon, Feb 01 2021, Chuck Lever wrote: > >>> On Jan 31, 2021, at 6:45 PM, NeilBrown <neilb@xxxxxxx> wrote: >>> >>> On Fri, Jan 29 2021, Chuck Lever wrote: >>>> >>>> What's your opinion? >>> >>> To form a coherent opinion, I would need to know what that problem is. >>> I certainly accept that there could be performance problems in releasing >>> and re-allocating pages which might be resolved by batching, or by copying, >>> or by better tracking. But without knowing what hot-spot you want to >>> cool down, I cannot think about how that fits into the big picture. >>> So: what exactly is the problem that you see? >> >> The problem is that each 1MB NFS READ, for example, hands 257 pages >> back to the page allocator, then allocates another 257 pages. One page >> is not terrible, but 510+ allocator calls for every RPC begins to get >> costly. >> >> Also, remember that both allocating and freeing a page means an irqsave >> spin lock -- that will serialize all other page allocations, including >> allocation done by other nfsd threads. >> >> So I'd like to lower page allocator contention and the rate at which >> IRQs are disabled and enabled when the NFS server becomes busy, as it >> might with several 25 GbE NICs, for instance. >> >> Holding onto the same pages means we can make better use of TLB >> entries -- fewer TLB flushes is always a good thing. >> >> I know that the network folks at Red Hat have been staring hard at >> reducing memory allocation in the stack for several years. I recall >> that Matthew Wilcox recently made similar improvements to the block >> layer. >> >> With the advent of 100GbE and Optane-like durable storage, the balance >> of memory allocation cost to device latency has shifted so that >> superfluous memory allocation is noticeable. >> >> >> At first I thought of creating a page allocator API that could grab >> or free an array of pages while taking the allocator locks once. But >> now I wonder if there are opportunities to reduce the amount of page >> allocator traffic. > > Thanks. This helps me a lot. > > I wonder if there is some low-hanging fruit here. > > If I read the code correctly (which is not certain, but what I see does > seem to agree with vague memories of how it all works), we currently do > a lot of wasted alloc/frees for zero-copy reads. > > We allocate lots of pages and store the pointers in ->rq_respages > (i.e. ->rq_pages) > Then nfsd_splice_actor frees many of those pages and > replaces the pointers with pointers to page-cache pages. Then we release > those page-cache pages. > > We need to have allocated them, but we don't need to free them. > We can add some new array for storing them, have nfsd_splice_actor move > them to that array, and have svc_alloc_arg() move pages back from the > store rather than re-allocating them. > > Or maybe something even more sophisticated where we only move them out > of the store when we actually need them. > > Having the RDMA layer return pages when they are finished with might > help. You might even be able to use atomics (cmpxchg) to handle the > contention. But I'm not convinced it would be worth it. > > I *really* like your idea of a batch-API for page-alloc and page-free. > This would likely be useful for other users, and it would be worth > writing more code to get peak performance - things such as per-cpu > queues of returned pages and so-forth (which presumably already exist). Baby steps. Because I'm perverse I started with bulk page freeing. In the course of trying to invent a new API, I discovered there already is a batch free_page() function called release_pages(). It seems to work as advertised for pages that are truly no longer in use (ie RPC/RDMA pages) but not for pages that are still in flight but released (ie TCP pages). release_pages() chains the pages in the passed-in array onto a list by their page->lru fields. This seems to be a problem if a page is still in use. > I cannot be sure that the batch-API would be better than a focused API > just for RDMA -> NFSD. But my guess is that it would be at least nearly > as good, and would likely get a lot more eyes on the code. > > NeilBrown -- Chuck Lever