Frustration when debugging OOMs, memory usage, and memory reclaim behaviour is a topic I think a lot of us can relate to. I think it might be worth having a talk to collectively air our frustrations and collect ideas for improvements. To start with: on memory allocation failure or OOM, we currently don't have a lot to go on. We get information about the allocation that failed, and only very coarse grained information about how memory is being tied up - page granural informatian aka show_mem() is nigh useless in most situations, and slab granural information is only slightly better. I have a couple ideas I want to float: - An old idea I've had and mentioned to some people before is to steal dynamic debug's trick of statically allocating tracking structs in a special elf section, and use it to wrap kmalloc(), alloc_pages() etc. calls for memory allocation tracking _per call site_, and then available in debugs broken out by file and line number. This would be cheap enough that it could be always on in production, unlike doing the same sort of thing with tracepoints. The cost would be another pointer of overhead for each allocation - for page allocations we've got CONFIG_PAGE_OWNER that does something like this (in a much more expensive fashion), and the pointer it uses could be repurposed. For slub/slab I think something analogous exists, but last I looked it'd probably need help from those developers (in both cases, really; mm code is hairy). - In bcachefs, I've been evolving a 'printbuf' thingy - heap allocated strings that you can pass around and append to. They make it really convenient to write pretty-printers for lots of things and pass them around, which in turn has made my life considerably easier in the debugging realm. I think that could be useful here: On a typical system shrinkers own a signifcant fraction of non-pagecache kernel memory, and shrinkers have internal state that's particular to each shrinker that's relevant to how much memory is currently freeable (dirtyness, locking issues). Imagine if shrinkers all had .to_text() methods, and then on memory allocation failure we could call those and print them for top-10 shrinkers by memory owned - in addition to sticking it in sysfs or debugfs.