In spite of https://meyerweb.com/eric/comment/chech.html I must pay homage to https://dl.acm.org/doi/10.1145/362929.362947 Modern CPUs are bad at walking linked lists. When you can issue 6 instructions per clock tick and it takes 100ns for an L3 cache miss, the ability for your CPU to predict your next memory reference and prefetch it as far ahead of time as possible is critical. I had originally planned to write a missive pleading for using a better data structure than a linked list for storing pages in an LRU. In my investigation, that turned out to be the wrong idea by a factor of about 3-4x. I now believe that _any_ virtual scanning is a mistake and we should be scanning memory physically. And I have the code to back up that belief. First, the code that simulates walking an LRU list vs an LRU array vs walking the physical array directly: https://www.infradead.org/~willy/linux/scan.c If you're going to run it for yourself, I recommend running it with -v to check how long it takes to run the physical scan; if it's too small, you may get some unstable results. I like to specify -c 10 (simulates 40GB of memory) to get more stable numbers. On my core i7-1165G7 @ 2.8GHz, I get $ ./scan -c 10 Physical scan is 37.962013 faster than ListLRU Physical scan is 3.847013 faster than ArrayLRU That bounces around a bit, but generally, it's 37-38x faster to do a physical scan than a ListLRU scan. I asked a few people to try it on their machines. <ziy> 17.2x with -c 50 on E5-2650 V4 @ 2.20GHz <ziy> 31.6x with -c 50 on Threadripper 3960X <ziy> 17.7x on cortex-A72. basically raspberrypi4. <ljs> on a coffee lake i9-9900k I see ~20 - 25 with a fair bit of variance with no -c specified, with -c 50 around ~28 with less variance The relative speed of memory & CPU, along with the effectiveness / aggressiveness of the CPU prefetchers will affect the exact numbers here, but you can see modern machines are all showing an advantage. I had intended to investigate using a radix tree or B-tree to store the LRU list, but a linear array of pointers to struct page should beat any data structure that is essentially a segmented array of pointers to struct page. And that is still about 3.8x slower than a physical scan, so I see no need to carry out the experiment of storing pages in an XArray, Maple Tree, or any other data structure. Admittedly, this is a very crude benchmark. In no way does it model the various tests done on the pages/folios that are scanned. It doesn't allocate large folios; they're all single pages. There are probably half a dozen other things wrong with it. Nevertheless, a 4x speedup on LRU scan would be amazing. A 40x speedup would be stunning. I don't have time to take on this project. I need to work on folios and other projects that I have started. I haven't done enough research to understand the various options we have for switching to a physical scan (and the downsides of those approaches!). I have no well-founded opinions on how we might go about this. Hopefully the potential win here is enough to motivate someone to want to work on this.