Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Thu, 12 Dec 2019 12:05:08 -0800

On Thu, Dec 12, 2019 at 10:29:02AM -0800, Linus Torvalds wrote:
> On Thu, Dec 12, 2019 at 9:52 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> > 1. We could semi-sort the pages on the LRU list.  If we know we're going
> > to remove a bunch of pages, we could take a batch of them off the list,
> > sort them and remove them in-order.  This probably wouldn't be terribly
> > effective.
> 
> I don't think the sorting is relevant.
> 
> Once you batch things, you already would get most of the locality
> advantage in the cache if it exists (and the batch isn't insanely
> large so that one batch already causes cache overflows).
> 
> The problem - I suspect - is that we don't batch at all. Or rather,
> the "batching" does exist at a high level, but it's so high that
> there's just tons of stuff going on between single pages. It is at the
> shrink_page_list() level, which is pretty high up and basically does
> one page at a time with locking and a lot of tests for each page, and
> then we do "__remove_mapping()" (which does some more work) one at a
> time before we actually get to __delete_from_page_cache().
> 
> So it's "batched", but it's in a huge loop, and even at that huge loop
> level the batch size is fairly small. We limit it to SWAP_CLUSTER_MAX,
> which is just 32.
> 
> Thinking about it, that SWAP_CLUSTER_MAX may make sense in some other
> circumstances, but not necessarily in the "shrink clean inactive
> pages" thing. I wonder if we could just batch clean pages a _lot_ more
> aggressively. Yes, our batching loop is still very big and it might
> not help at an L1 level, but it might help in the L2, at least.
> 
> In kswapd, when we have 28 GB of pages on the inactive list, a batch
> of 32 pages at a time is pretty small ;)

Yeah, that's pretty poor.  I just read through it, and even if pages are
in order on the page list, they're not going to batch nicely.  It'd be
nice to accumulate them and call delete_from_page_cache_batch(), but we
need to put shadow entries in to replace them, so we'd need a variant
of that which took two pagevecs.

> > 2. We could change struct page to point to the xa_node that holds them.
> > Looking up the page mapping would be page->xa_node->array and then
> > offsetof(i_pages) to get the mapping.
> 
> I don't think we have space in 'struct page', and I'm pretty sure we
> don't want to grow it. That's one of the more common data structures
> in the kernel.

Oh, I wasn't clear.  I meant replace page->mapping with page->xa_node.
We could still get from page to mapping, but it would be an extra
dereference.  I did say it was a _bad_ idea.