On Fri, Dec 01, 2023 at 12:18:44PM +1100, Dave Chinner wrote: > On Thu, Nov 30, 2023 at 11:01:23AM -0800, Roman Gushchin wrote: > > On Wed, Nov 29, 2023 at 10:21:49PM -0500, Kent Overstreet wrote: > > > On Thu, Nov 30, 2023 at 11:09:42AM +0800, Qi Zheng wrote: > > > > For non-bcachefs developers, who knows what those statistics mean? > > > Ok, a simple question then: > > why can't you dump /proc/slabinfo after the OOM? > > Taken to it's logical conclusion, we arrive at: > > OOM-kill doesn't need to output anything at all except for > what it killed because we can dump > /proc/{mem,zone,vmalloc,buddy,slab}info after the OOM.... > > As it is, even asking such a question shows that you haven't looked > at the OOM kill output for a long time - it already reports the slab > cache usage information for caches that are reclaimable. > > That is, if too much accounted slab cache based memory consumption > is detected at OOM-kill, it will calldump_unreclaimable_slab() to > dump all the SLAB_RECLAIM_ACCOUNT caches (i.e. those with shrinkers) > to the console as part of the OOM-kill output. You are right, I missed that, partially because most of OOM's I had to deal with recently were memcg OOM's. This changes my perspective at Kent's patches, if we dump this information already, it might be not a bad idea to do it nicer. So I take my words back here. > > The problem Kent is trying to address is that this output *isn't > sufficient to debug shrinker based memory reclaim issues*. It hasn't > been for a long time, and so we've all got our own special debug > patches and methods for checking that shrinkers are doing what they > are supposed to. Kent is trying to formalise one of the more useful > general methods for exposing that internal information when OOM > occurs... > > Indeed, I can think of several uses for a shrinker->to_text() output > that we simply cannot do right now. > > Any shrinker that does garbage collection on something that is not a > pure slab cache (e.g. xfs buffer cache, xfs inode gc subsystem, > graphics memory allocators, binder, etc) has no visibility of the > actuall memory being used by the subsystem in the OOM-kill output. > This information isn't in /proc/slabinfo, it's not accounted by a > SLAB_RECLAIM_ACCOUNT cache, and it's not accounted by anything in > the core mm statistics. > > e.g. How does anyone other than a XFS expert know that the 500k of > active xfs_buf handles in the slab cache actually pins 15GB of > cached metadata allocated directly from the page allocator, not just > the 150MB of slab cache the handles take up? > > Another example is that an inode can pin lots of heap memory (e.g. > for in-memory extent lists) and that may not be freeable until the > inode is reclaimed. So while the slab cache might not be excesively > large, we might have an a million inodes with a billion cumulative > extents cached in memory and it is the heap memory consumed by the > cached extents that is consuming the 30GB of "missing" kernel memory > that is causing OOM-kills to occur. > > How is a user or developer supposed to know when one of these > situations has occurred given the current lack of memory usage > introspection into subsystems? What would be the proper solution to this problem from your point of view? What functionality/API mm can provide to make the life of fs developers better here? > > These are the sorts of situations that shrinker->to_text() would > allow us to enumerate when it is necessary (i.e. at OOM-kill). At > any other time, it just doesn't matter, but when we're at OOM having > a mechanism to report somewhat accurate subsystem memory consumption > would be very useful indeed. > > > Unlike anon memory, slab memory (fs caches in particular) should not be heavily > > affected by killing some userspace task. > > Whether tasks get killed or not is completely irrelevant. The issue > is that not all memory that is reclaimed by shrinkers is either pure > slab cache memory or directly accounted as reclaimable to the mm > subsystem.... My problem with the current OOM reporting infrastructure (and it's a bit an offtopic here) - it's good for manually looking into these reports, but not particularly great for automatic collection and analysis at scale. So this is where I was coming from. Thanks!