On Wed, Jan 8, 2025 at 1:46 PM Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> wrote: > > Hi all, > > Not too long ago I took some time to investigate the possibility of > scanning physical memory directly by traversing the memory map directly > rather than the LRU linked list. > > This was inspired by a post from Matthew [0] wherein he demonstrated just > how significant the difference is between traversing arrays of contiguous > data on a modern system vs. the almost worst-case scenario of traversing a > linked-list. > > I tested how this might look by implementing code which simply traverses > and filters the memory map for LRU pages, simplifying as much as possible. > > However no matter which machine (ranging from 16 GB - 192 GB) or whether > virtualised or real hardware, I found unfortunately disappointing results - > the act of having to scan such a large range of memory resulted in > performance significantly less than a typical LRU scan at low memory > utilisation and performance at best matching LRU scanning at high memory > utilisation (simulating higher memory pressure). > > There are a number of factors at play here, and perhaps the shrinkage of > struct page (allowing for denser placement in cache lines), or an improved > algorithm might lead to more promising results. > > Having discussed this with Matthew, he suggested I put forward a proposal > to discuss this area in order that we can learn from this should it appear > this approach is unworkable or perhaps determine whether there might be > something to this that we might still salvage. > > I intend to do some more research and generate some more specific numbers > (feel free to give feedback here) before LSF so we can have something more > specific to talk about. > > I always envisioned this approach being somehow integrated with MGLRU and I > wonder if some hybrid means of integrating this approach with the MGLRU one > might make sense, which could also be another area of discussion. When I read this proposal the first thing that came to mind was memcg reclaim. While it seems to me that it is already inefficient to scan all physical memory looking for a possible reclaim candidate, it seems even more inefficient to try to find a possible reclaim candidate within the needed memcg. We'll also probably do a lot of repeated scanning as we iterate memcgs. The per-memcg per-node LRUs save us from this. Do you have an idea about handling memcg reclaim efficiently?