On Tue, Apr 26, 2022 at 04:02:19PM +1000, Dave Chinner wrote: > On Fri, Apr 22, 2022 at 01:26:37PM -0700, Roman Gushchin wrote: > > There are 50+ different shrinkers in the kernel, many with their own bells and > > whistles. Under the memory pressure the kernel applies some pressure on each of > > them in the order of which they were created/registered in the system. Some > > of them can contain only few objects, some can be quite large. Some can be > > effective at reclaiming memory, some not. > > > > The only existing debugging mechanism is a couple of tracepoints in > > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't > > covering everything though: shrinkers which report 0 objects will never show up, > > there is no support for memcg-aware shrinkers. Shrinkers are identified by their > > scan function, which is not always enough (e.g. hard to guess which super > > block's shrinker it is having only "super_cache_scan"). > > In general, I've had no trouble identifying individual shrinker > instances because I'm always looking at individual subsystem > shrinker tracepoints, too. Hence I've almost always got the > identification information in the traces I need to trace just the > individual shrinker tracepoints and a bit of sed/grep/awk and I've > got something I can feed to gnuplot or a python script to graph... You spent a lot of time working on shrinkers in general and xfs-specific shrinkers in particular, no questions here. But imagine someone who's not a core-mm developer and is adding a new shrinker. > > > They are a passive > > mechanism: there is no way to call into counting and scanning of an individual > > shrinker and profile it. > > IDGI. profiling shrinkers iunder ideal conditions when there isn't > memory pressure is largely a useless exercise because execution > patterns under memory pressure are vastly different. > > All the problems with shrinkers show up when progress cannot be made > as fast as memory reclaim wants memory to be reclaimed. How do you > trigger priority windup causing large amounts of deferred processing > because shrinkers are running in GFP_NOFS/GFP_NOIO context? How do > you simulate objects getting dirtied in memory so they can't be > immediately reclaimed so the shrinker can't make any progress at all > until IO completes? How do you simulate the unbound concurrency that > direct reclaim can drive into the shrinkers that causes massive lock > contention on shared structures and locks that need to be accessed > to free objects? These are valid points and I assume we can find ways to emulate some of these conditions, e.g. by allowing to run scanning using the GFP_NOFS context. I though about it but decided to left for further improvements. > > IOWs, if all you want to do is profile shrinkers running in the > absence of memory pressure, then you can do that perfectly well with > the existing 'echo 2 > /proc/sys/vm/drop_caches' mechanism. We don't > need some complex debugfs API just to profile the shrinker > behaviour. And then we need somehow separate shrinkers in the result? > > So why do we need any of the complexity and potential for abuse that > comes from exposing control of shrinkers directly to userspace like > these patches do? I feel like the added complexity is minimal (unlike slab's sysfs, for example). If the config option is off (by default), there is no additional risk and overhead as well. > > > To provide a better visibility and debug options for memory shrinkers > > this patchset introduces a /sys/kernel/debug/shrinker interface, to some extent > > similar to /sys/kernel/slab. > > /sys/kernel/slab contains read-only usage information - it is > analagous for visibility arguments, but it is not equivalent for > the rest of the "active" functionality you want to add here.... > > > For each shrinker registered in the system a directory is created. The directory > > contains "count" and "scan" files, which allow to trigger count_objects() > > and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers > > count_memcg, scan_memcg, count_node, scan_node, count_memcg_node > > and scan_memcg_node are additionally provided. They allow to get per-memcg > > and/or per-node object count and shrink only a specific memcg/node. > > Great, but why does the shrinker introspection interface need active > scan control functions like these? It makes testing of (new) shrinkers easier, for example. For instance, shadow entries shrinker hides associated objects by returning 0 count most of the time (unless the total consumed memory is bigger than a certain amount of the total memory). echo 2 > /proc/sys/vm/drop_caches won't even trigger the scanning. > > > To make debugging more pleasant, the patchset also names all shrinkers, > > so that debugfs entries can have more meaningful names. > > > > Usage examples: > > > > 1) List registered shrinkers: > > $ cd /sys/kernel/debug/shrinker/ > > $ ls > > dqcache-16 sb-cgroup2-30 sb-hugetlbfs-33 sb-proc-41 sb-selinuxfs-22 sb-tmpfs-40 sb-zsmalloc-19 > > kfree_rcu-0 sb-configfs-23 sb-iomem-12 sb-proc-44 sb-sockfs-8 sb-tmpfs-42 shadow-18 > > sb-aio-20 sb-dax-11 sb-mqueue-21 sb-proc-45 sb-sysfs-26 sb-tmpfs-43 thp_deferred_split-10 > > sb-anon_inodefs-15 sb-debugfs-7 sb-nsfs-4 sb-proc-47 sb-tmpfs-1 sb-tmpfs-46 thp_zero-9 > > sb-bdev-3 sb-devpts-28 sb-pipefs-14 sb-pstore-31 sb-tmpfs-27 sb-tmpfs-49 xfs_buf-37 > > sb-bpf-32 sb-devtmpfs-5 sb-proc-25 sb-rootfs-2 sb-tmpfs-29 sb-tracefs-13 xfs_inodegc-38 > > sb-btrfs-24 sb-hugetlbfs-17 sb-proc-39 sb-securityfs-6 sb-tmpfs-35 sb-xfs-36 zspool-34 > > Ouch. That's not going to be useful for humans debugging a system as > there's no way to cross reference a "superblock" with an actual > filesystem mount point. Nor is there any way to reallly know that > all the shrinkers in one filesystem are related. > > We normally solve this by ensuring that the fs related object has > the short bdev name appended to them. e.g: > > $ pgrep xfs > 1 I root 36 2 0 60 -20 - 0 - Apr19 ? 00:00:10 [kworker/0:1H-xfs-log/dm-3] > 1 I root 679 2 0 60 -20 - 0 - Apr19 ? 00:00:00 [xfsalloc] > 1 I root 680 2 0 60 -20 - 0 - Apr19 ? 00:00:00 [xfs_mru_cache] > 1 I root 681 2 0 60 -20 - 0 - Apr19 ? 00:00:00 [xfs-buf/dm-1] > ..... > > Here we have a kworker process running log IO completion work on > dm-3, two global workqueue rescuer tasks (alloc, mru) and a rescuer > task for xfs-buf workqueue on dm-1. > > We need the same name discrimination for shrinker information here, > too - just saying "this is an XFS superblock shrinker" is just not > sufficient when there are hundreds of XFS mount points with a > handful of shrinkers each. Good point, I think it's doable, and I really like it. > > > 2) Get information about a specific shrinker: > > $ cd sb-btrfs-24/ > > $ ls > > count count_memcg count_memcg_node count_node scan scan_memcg scan_memcg_node scan_node > > > > 3) Count objects on the system/root cgroup level > > $ cat count > > 212 > > > > 4) Count objects on the system/root cgroup level per numa node (on a 2-node machine) > > $ cat count_node > > 209 3 > > So a single space separated line with a number per node? > > When you have a few hundred nodes and hundreds of thousands of objects per > node, we overrun the 4kB page size with a single line. What then? With seq_buf api we don't have 4kb limit, do we? > > > 5) Count objects for each memcg (output format: cgroup inode, count) > > $ cat count_memcg > > 1 212 > > 20 96 > > 53 817 > > 2297 2 > > 218 13 > > 581 30 > > 911 124 > > <CUT> > > What does "<CUT>" mean? I've just shortened the lengthy output, not a part of the original output. > > Also, this now iterates separate memcg per line. A parser now needs > to know the difference between count/count_node and > count_memcg/count_memcg_node because they are subtly different file > formats. These files should have the same format, otherwise it just > creates needless complexity. > > Indeed, why do we even need count/count_node? They are just the > "index 1" memcg output, so are totally redundant. Ok, but then we need a flag to indicate that a shrinker is memcg-aware? But I got your point and I (partially) agree. But do you think we're fine with just one interface and don't need an aggregation over nodes? So just count_memcg_node? > > > 6) Same but with a per-node output > > $ cat count_memcg_node > > 1 209 3 > > 20 96 0 > > 53 810 7 > > 2297 2 0 > > 218 13 0 > > 581 30 0 > > 911 124 0 > > <CUT> > > So now we have a hundred nodes in the machine and thousands of > memcgs. And the information we want is in the numerically largest > memcg that is last in the list. ANd we want to graph it's behaviour > over time at high resolution (say 1Hz). Now we burn huge amounts > of CPU counting memcgs that we don't care about and then throwing > away most of the information. That's highly in-efficient and really > doesn't scale. For this case we can provide an interface which allows to specify both node and memcg and get the count. Personally I don't have a machine with hundred nodes, so it's not on my radar. If you find it useful, happy to add. Thanks! Roman