On Thu, Dec 12, 2019 at 10:29:02AM -0800, Linus Torvalds wrote: > On Thu, Dec 12, 2019 at 9:52 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > 1. We could semi-sort the pages on the LRU list. If we know we're going > > to remove a bunch of pages, we could take a batch of them off the list, > > sort them and remove them in-order. This probably wouldn't be terribly > > effective. > > I don't think the sorting is relevant. > > Once you batch things, you already would get most of the locality > advantage in the cache if it exists (and the batch isn't insanely > large so that one batch already causes cache overflows). > > The problem - I suspect - is that we don't batch at all. Or rather, > the "batching" does exist at a high level, but it's so high that > there's just tons of stuff going on between single pages. It is at the > shrink_page_list() level, which is pretty high up and basically does > one page at a time with locking and a lot of tests for each page, and > then we do "__remove_mapping()" (which does some more work) one at a > time before we actually get to __delete_from_page_cache(). > > So it's "batched", but it's in a huge loop, and even at that huge loop > level the batch size is fairly small. We limit it to SWAP_CLUSTER_MAX, > which is just 32. > > Thinking about it, that SWAP_CLUSTER_MAX may make sense in some other > circumstances, but not necessarily in the "shrink clean inactive > pages" thing. I wonder if we could just batch clean pages a _lot_ more > aggressively. Yes, our batching loop is still very big and it might > not help at an L1 level, but it might help in the L2, at least. > > In kswapd, when we have 28 GB of pages on the inactive list, a batch > of 32 pages at a time is pretty small ;) Yeah, that's pretty poor. I just read through it, and even if pages are in order on the page list, they're not going to batch nicely. It'd be nice to accumulate them and call delete_from_page_cache_batch(), but we need to put shadow entries in to replace them, so we'd need a variant of that which took two pagevecs. > > 2. We could change struct page to point to the xa_node that holds them. > > Looking up the page mapping would be page->xa_node->array and then > > offsetof(i_pages) to get the mapping. > > I don't think we have space in 'struct page', and I'm pretty sure we > don't want to grow it. That's one of the more common data structures > in the kernel. Oh, I wasn't clear. I meant replace page->mapping with page->xa_node. We could still get from page to mapping, but it would be an extra dereference. I did say it was a _bad_ idea.