Re: [PATCH 2/7] mm: shrinker: Add a .to_text() method for shrinkers

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Wed, 29 Nov 2023 22:21:49 -0500

On Thu, Nov 30, 2023 at 11:09:42AM +0800, Qi Zheng wrote:
> 
> 
> On 2023/11/30 07:11, Kent Overstreet wrote:
> > On Wed, Nov 29, 2023 at 10:14:54AM +0100, Michal Hocko wrote:
> > > On Tue 28-11-23 16:34:35, Roman Gushchin wrote:
> > > > On Tue, Nov 28, 2023 at 02:23:36PM +0800, Qi Zheng wrote:
> > > [...]
> > > > > Now I think adding this method might not be a good idea. If we allow
> > > > > shrinkers to report thier own private information, OOM logs may become
> > > > > cluttered. Most people only care about some general information when
> > > > > troubleshooting OOM problem, but not the private information of a
> > > > > shrinker.
> > > > 
> > > > I agree with that.
> > > > 
> > > > It seems that the feature is mostly useful for kernel developers and it's easily
> > > > achievable by attaching a bpf program to the oom handler. If it requires a bit
> > > > of work on the bpf side, we can do that instead, but probably not. And this
> > > > solution can potentially provide way more information in a more flexible way.
> > > > 
> > > > So I'm not convinced it's a good idea to make the generic oom handling code
> > > > more complicated and fragile for everybody, as well as making oom reports differ
> > > > more between kernel versions and configurations.
> > > 
> > > Completely agreed! From my many years of experience of oom reports
> > > analysing from production systems I would conclude the following categories
> > > 	- clear runaways (and/or memory leaks)
> > > 		- userspace consumers - either shmem or anonymous memory
> > > 		  predominantly consumes the memory, swap is either depleted
> > > 		  or not configured.
> > > 		  OOM report is usually useful to pinpoint those as we
> > > 		  have required counters available
> > > 		- kernel memory consumers - if we are lucky they are
> > > 		  using slab allocator and unreclaimable slab is a huge
> > > 		  part of the memory consumption. If this is a page
> > > 		  allocator user the oom repport only helps to deduce
> > > 		  the fact by looking at how much user + slab + page
> > > 		  table etc. form. But identifying the root cause is
> > > 		  close to impossible without something like page_owner
> > > 		  or a crash dump.
> > > 	- misbehaving memory reclaim
> > > 		- minority of issues and the oom report is usually
> > > 		  insufficient to drill down to the root cause. If the
> > > 		  problem is reproducible then collecting vmstat data
> > > 		  can give a much better clue.
> > > 		- high number of slab reclaimable objects or free swap
> > > 		  are good indicators. Shrinkers data could be
> > > 		  potentially helpful in the slab case but I really have
> > > 		  hard time to remember any such situation.
> > > On non-production systems the situation is quite different. I can see
> > > how it could be very beneficial to add a very specific debugging data
> > > for subsystem/shrinker which is developed and could cause the OOM. For
> > > that purpose the proposed scheme is rather inflexible AFAICS.
> > 
> > Considering that you're an MM guy, and that shrinkers are pretty much
> > universally used by _filesystem_ people - I'm not sure your experience
> > is the most relevant here?
> > 
> > The general attitude I've been seeing in this thread has been one of
> > dismissiveness towards filesystem people. Roman too; back when he was
> 
> Oh, please don't say that, it seems like you are the only one causing
> the fight. We deeply respect the opinions of file system developers, so
> I invited Dave to this thread from the beginning. And you didn’t CC
> linux-fsdevel@xxxxxxxxxxxxxxx yourself.
> 
> > working on his shrinker debug feature I reached out to him, explained
> > that I was working on my own, and asked about collaborating - got
> > crickets in response...
> > 
> > Hmm..
> > 
> > Besides that, I haven't seen anything what-so-ever out of you guys to
> > make our lives easier, regarding OOM debugging, nor do you guys even
> > seem interested in the needs and perspectives of the filesytem people.
> > Roman, your feature didn't help one bit for OOM debuging - didn't even
> > come with documentation or hints as to what it's for.
> > 
> > BPF? Please.
> 
> (Disclaimer, no intention to start a fight, here are some objective
> views.)
> 
> Why not? In addition to printk, there are many good debugging tools
> worth trying, such as BPF related tools, drgn, etc.
> 
> For non-bcachefs developers, who knows what those statistics mean?
> 
> You can use BPF or drgn to traverse in advance to get the address of the
> bcachefs shrinker structure, and then during OOM, find the bcachefs
> private structure through the shrinker->private_data member, and then
> dump the bcachefs private data. Is there any problem with this?

No, BPF is not an excuse for improving our OOM/allocation failure
reports. BPF/tracing are secondary tools; whenever we're logging
information about a problem we should strive to log enough information
to debug the issue.

We've got junk in there we don't need: as mentioned before, there's no
need to be dumping information on _every_ slab, we can pick the ones
using the most memory and show those.

Similarly for shrinkers, we're not going to be printing all of them -
the patchset picks the top 10 by objects and prints those. Could
probably be ~4, there's fewer shrinkers than slabs; also if we can get
shrinkers to report on memory owned in bytes, that will help too with
deciding what information is pertinent.

That's not a huge amount of information to be dumping, and to make it
easier to debug something that has historically been a major pain point.

There's a lot more that could be done to make our OOM reports more
readable and useful to non-mm developers. Unfortunately, any time
changing the show_mem report the immediate reaction seems to be "but
that will break my log parsing/change what I'm used to!"...