[LSF/MM TOPIC] Improving OOM debugging

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Mon, 21 Mar 2022 20:51:01 -0400

Frustration when debugging OOMs, memory usage, and memory reclaim behaviour is a
topic I think a lot of us can relate to.

I think it might be worth having a talk to collectively air our frustrations and
collect ideas for improvements.

To start with: on memory allocation failure or OOM, we currently don't have a
lot to go on. We get information about the allocation that failed, and only very
coarse grained information about how memory is being tied up - page granural
informatian aka show_mem() is nigh useless in most situations, and slab granural
information is only slightly better.

I have a couple ideas I want to float:
 - An old idea I've had and mentioned to some people before is to steal dynamic
   debug's trick of statically allocating tracking structs in a special elf
   section, and use it to wrap kmalloc(), alloc_pages() etc. calls for memory
   allocation tracking _per call site_, and then available in debugs broken out
   by file and line number.

   This would be cheap enough that it could be always on in production, unlike
   doing the same sort of thing with tracepoints. The cost would be another
   pointer of overhead for each allocation - for page allocations we've got
   CONFIG_PAGE_OWNER that does something like this (in a much more expensive
   fashion), and the pointer it uses could be repurposed. For slub/slab I think
   something analogous exists, but last I looked it'd probably need help from
   those developers (in both cases, really; mm code is hairy).

 - In bcachefs, I've been evolving a 'printbuf' thingy - heap allocated strings
   that you can pass around and append to. They make it really convenient to
   write pretty-printers for lots of things and pass them around, which in turn
   has made my life considerably easier in the debugging realm.

   I think that could be useful here: On a typical system shrinkers own a
   signifcant fraction of non-pagecache kernel memory, and shrinkers have
   internal state that's particular to each shrinker that's relevant to how much
   memory is currently freeable (dirtyness, locking issues).

   Imagine if shrinkers all had .to_text() methods, and then on memory
   allocation failure we could call those and print them for top-10 shrinkers by
   memory owned - in addition to sticking it in sysfs or debugfs.