On Mon, Apr 04, 2022 at 12:08:25PM -0700, Roman Gushchin wrote: > On Mon, Apr 04, 2022 at 11:09:48AM +1000, Dave Chinner wrote: > > i.e. the amount of work that shrinkers need to do in a periodic scan > > is largerly determined by the rate of shrinkable cache memory usage > > growth rather than memory reclaim priority as it is now. Hence there > > needs to be different high level "shrinker needs to do X amount of > > work" calculation for periodic reclaim than there is now. > > > > e.g. we calculate a rolling average of the size of the cache and a > > rate of change over a series of polling operations (i.e. calling > > ->scan_count) and then when sustained growth is detected we start > > trying to shrink the cache to limit the rate of growth of the cache. > > > > If the cache keeps growing, then it's objects are being repeatedly > > referenced and it *should* keep growing. If it's one-off objects > > that are causing the growth of the cache and so objects are being > > reclaimed by the shrinker, then matching the periodic shrink scan to > > the growth rate will substantially reduce the rate of growth of that > > cache. > > A clever idea! > > It seems like we need to add some stats to the list_lru API or maybe to > the shrinker API (and let list_lru to use it). > E.g. total/scanned/reclaimed, maybe with a time decay > > I'm also thinking about: > 1) adding a sysfs/debugfs interface to expose shrinkers current size and > statistics, with an ability to call into the reclaim manually. I've thought about it, too, and can see where it could be useful. However, when I consider the list_lru memcg integration, I suspect it becomes a "can't see the forest for the trees" problem. We're going to end up with millions of sysfs objects with no obvious way to navigate, iterate or search them if we just take the naive "sysfs object + stats per list_lru instance" approach. Also, if you look at commit 6a6b7b77cc0f mm: ("list_lru: transpose the array of per-node per-memcg lru lists") that went into 5.18-rc1, you'll get an idea of the amount of memory overhead just tracking the list_lru x memcg infrastructure consumes at scale: I had done a easy test to show the optimization. I create 10k memory cgroups and mount 10k filesystems in the systems. We use free command to show how many memory does the systems comsumes after this operation (There are 2 numa nodes in the system). +-----------------------+------------------------+ | condition | memory consumption | +-----------------------+------------------------+ | without this patchset | 24464 MB | +-----------------------+------------------------+ | after patch 1 | 21957 MB | <--------+ +-----------------------+------------------------+ | | after patch 10 | 6895 MB | | +-----------------------+------------------------+ | | after patch 12 | 4367 MB | | +-----------------------+------------------------+ | | The more the number of nodes, the more obvious the effect---+ If we now add sysfs objects and stats arrays to each of the list_lrus that we initiate even now on 5.18-rc1, we're going to massively blow out the memory footprint again. So, nice idea, but I'm not sure we can make it useful and not consume huge amounts of memory.... What might be more useful is a way of getting the kernel to tell us what the, say, 20 biggest slab caches are in the system and then provide a way to selectively shrink them. We generally don't really care about tiny caches, just the ones consuming all the memory. This isn't my idea - Kent has been looking at this because of how useless OOM kill output is for debugging slab cache based OOM triggers. See this branch for an example: https://evilpiepirate.org/git/bcachefs.git/log/?h=shrinker_to_text Even a sysfs entry that you echo a number into and it returns the "top N" largest LRU lists. echo -1 into it and it returns every single one if you want all the information. The whole "one sysfs file, one value" architecture falls completely apart when we might have to indexing *millions* of internal structures with many parameters per structure... > 2) formalizing a reference bit/counter API on the shrinkers level, so that > shrinker users can explicitly mark objects (re)-"activation". Not 100% certain what you are refering to here - something to do with active object rotation? Or an active/inactive list split with period demotion like we have for the page LRUs? Or workingset refault detection? Can you explain in more detail? > 3) _maybe_ we need to change the shrinkers API from a number of objects to > bytes, so that releasing a small number of large objects can compete with > a releasing on many small objects. But I'm not sure. I think I suggested something similar a long time ago. We have shrinkers that track things other than slab objects. e.g. IIRC the ttm graphics allocator shrinker tracks sets of pages and frees pages, not slab objects. The XFS buffer cache tracks variable sized objects, from 512 bytes to 64KB in length, so the amount of memory it frees is variable even if the number of handles it scans and reclaims is fixed and consistent. Other subsystems have different "non object" shrinker needs as well. The long and short of it is that two shrinkers might have the same object count, but one might free 10x the amount of memory than the other for the same amount of shrinking work. Being able to focus reclaim work on caches that can free a lot of memory much more quickly would be a great idea. It also means that a shrinker that scans a fragmented slab can keep going until a set number of slab pages have been freed, rather than a set number of slab objects. We can push reclaim of fragmented slab caches much harder when necessary if we are reclaiming by freed byte counts... So, yeah, byte-count based reclaim definitely has merit compared to what we currently do. It's more generic and more flexible... > 4) also some shrinkers (e.g. shadow nodes) lying about the total size of > objects and I have an uneasy feeling about this approach. Yeah, never been a fan about use cases like that, but IIRC it was really the only option at the time to only trigger reclaim when the kernel was absolutely desperate because we really need the working set shadow nodes to do their job well when memory is very low... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx