On Mon, 4 Apr 2022 11:09:48 +1000 Dave Chinner wrote: > On Sun, Apr 03, 2022 at 08:56:18AM +0800, Hillf Danton wrote: > > On Sat, 2 Apr 2022 10:54:36 -0700 Roman Gushchin wrote: > > > Hello Hillf! > > > > > Hello Roman, > > > > > Thank you for sharing it, really interesting! I=E2=80=99m actually working o= > > > n the same problem.=20 > > > > Good to know you have some interest in it. > > Feel free to let me know you would like to take it over to avoid > > repeated works on both sides. > > > > > > > > No code to share yet, but here are some of my thoughts: > > > 1) If there is a =E2=80=9Cnatural=E2=80=9D memory pressure, no additional sl= > > > ab scanning is needed. > > > > Agree - the periodic shrinker can be canceled once kswapd wakes up. > > I think we should be waking up per-node kswapd to do the periodic > shrinking, not adding yet another way of executing (thousands of) > shrinkers (across hundreds of nodes) from a single threaded context. Kswapd is majorly responsible for keeping the high water mark, a different target from cold slab objects - I am inclined to staying a safe distance from any code churn in that area. > > Indeed, the problem of "no reclaim when there is no memory > pressure" also affects the page cache, not just shrinkable caches > and we really don't want periodic reclaim to have a compeltely > different behaviour to general memory reclaim. There are also fluhers wrt page cache, another target for periodic shrinker to shun. > > i.e. the amount of work that shrinkers need to do in a periodic scan > is largerly determined by the rate of shrinkable cache memory usage > growth rather than memory reclaim priority as it is now. Hence there Same thing in two different phrases - no slab objects can be shrinked in any periodic scan unless the number of objects goes up through a bar, 1.5 meters high or five, and the seconds or minutes they take to touch that bar do not matter. The priority is trying to avoid taking 100 big Mac bergers in brunch, though. > needs to be different high level "shrinker needs to do X amount of > work" calculation for periodic reclaim than there is now. Is it a layer violation that kswapd spends two weeks learning where cold slab objects are piling up in xfs and how? Any gc worker scheduled before free pages fall below the high water mark? > > e.g. we calculate a rolling average of the size of the cache and a > rate of change over a series of polling operations (i.e. calling > ->scan_count) and then when sustained growth is detected we start > trying to shrink the cache to limit the rate of growth of the cache. Is it more difficult or easier than estimating the size of active file lru? > > If the cache keeps growing, then it's objects are being repeatedly > referenced and it *should* keep growing. If it's one-off objects > that are causing the growth of the cache and so objects are being > reclaimed by the shrinker, then matching the periodic shrink scan to > the growth rate will substantially reduce the rate of growth of that > cache. What is xfs gc worker if any doing wrt detecting one-off objects? > > And if it's integrated into the existing do_shrink_slab > calculations, the moment actual memory reclaim calls the shrinker > the periodic scan calculations can be reset back to zero and it > starts again... > > > > 2) =46rom a power perspective it=E2=80=99s better to scan more at once, but l= > > > ess often. > > > > The shrinker proposed is a catapult on the vmscan side without knowing > > where the cold slab objects are piling up in Dave's backyard but he is > > free to take different actions than the regular shrinker - IOW this > > shrinker alone does not make much sense wrt shooting six birds without > > the stone on the slab owner side. > > > > It is currently scanning *every* slab cache at an arbitrary frequency, > > once 30 seconds - I am open to a minute or whatever. > > Sorry, I don't understand what "Dave's backyard" is or why it would > ever need to be special cased? How many xfs shrinkers are registered currently? Is it unusual for xfs to have slab objects over a million including two cold ones without waking kswapd up? > > > > 3) Maybe we need a feedback loop with the slab allocator: e.g. if slabs are a= > > > lmost full there is more sense to do a proactive scanning and free up some m= > > > emory, otherwise we=E2=80=99ll end up allocating more slabs. But it=E2=80=99= > > > s tricky. > > > > There are 31 bits available in the periodic flag added to shrink control. > > > > > 4) If the scanning is not resulting in any memory reclaim, maybe we should (= > > > temporarily) exclude the corresponding shrinker from the scanning. > > > > Given the periodic flag, Dave is free to ignore the scan request and the > > scan result is currently dropped on the vmscan side because what is > > considered is the cold slab objects that for instance have been inactive > > for more than 30 seconds in every slab cache, rather than kswapd's cake. > > I don't understand how passing a "periodic" flag to individual > shrinkers is really useful here. How does the shrinker > implementation use this to determine how much work it needs to do? No dirty pages no flush. Periodic means this is not the kswapd bulldozer and feel free to recycle cold objects with sc->nr_to_scan ignored. > > i.e. The amount of work a shrinker needs to perform is calculated by > the high level slab scanning code based on relative cache size and > reclaim priority. If there's a periodic scanner, it should be > calculating a specific amount of work for the shrinkers to do way up > in do_shrink_slab() and then asking the shrinker to perform that > work in exactly the way it does now - the shrinker itself doesn't > need to know anything about whether it's a periodic memory reclaim > scan or whether there's actual memory pressure - it just needs to > scan the oldest objects in it's cache to try to reclaim them. More than fine, SHRINK_BATCH >> 2 can do the job if shrinker prefers nr_to_scan over prioity and periodic. Thanks Hillf