Re: [PATCH 2/7] mm: shrinker: Add a .to_text() method for shrinkers

Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> · Thu, 30 Nov 2023 11:42:45 +0800

On 2023/11/30 11:21, Kent Overstreet wrote:
On Thu, Nov 30, 2023 at 11:09:42AM +0800, Qi Zheng wrote:

On 2023/11/30 07:11, Kent Overstreet wrote:
On Wed, Nov 29, 2023 at 10:14:54AM +0100, Michal Hocko wrote:
On Tue 28-11-23 16:34:35, Roman Gushchin wrote:
On Tue, Nov 28, 2023 at 02:23:36PM +0800, Qi Zheng wrote:
[...]
Now I think adding this method might not be a good idea. If we allow
shrinkers to report thier own private information, OOM logs may become
cluttered. Most people only care about some general information when
troubleshooting OOM problem, but not the private information of a
shrinker.

I agree with that.

It seems that the feature is mostly useful for kernel developers and it's easily
achievable by attaching a bpf program to the oom handler. If it requires a bit
of work on the bpf side, we can do that instead, but probably not. And this
solution can potentially provide way more information in a more flexible way.

So I'm not convinced it's a good idea to make the generic oom handling code
more complicated and fragile for everybody, as well as making oom reports differ
more between kernel versions and configurations.

Completely agreed! From my many years of experience of oom reports
analysing from production systems I would conclude the following categories
	- clear runaways (and/or memory leaks)
		- userspace consumers - either shmem or anonymous memory
		  predominantly consumes the memory, swap is either depleted
		  or not configured.
		  OOM report is usually useful to pinpoint those as we
		  have required counters available
		- kernel memory consumers - if we are lucky they are
		  using slab allocator and unreclaimable slab is a huge
		  part of the memory consumption. If this is a page
		  allocator user the oom repport only helps to deduce
		  the fact by looking at how much user + slab + page
		  table etc. form. But identifying the root cause is
		  close to impossible without something like page_owner
		  or a crash dump.
	- misbehaving memory reclaim
		- minority of issues and the oom report is usually
		  insufficient to drill down to the root cause. If the
		  problem is reproducible then collecting vmstat data
		  can give a much better clue.
		- high number of slab reclaimable objects or free swap
		  are good indicators. Shrinkers data could be
		  potentially helpful in the slab case but I really have
		  hard time to remember any such situation.
On non-production systems the situation is quite different. I can see
how it could be very beneficial to add a very specific debugging data
for subsystem/shrinker which is developed and could cause the OOM. For
that purpose the proposed scheme is rather inflexible AFAICS.

Considering that you're an MM guy, and that shrinkers are pretty much
universally used by _filesystem_ people - I'm not sure your experience
is the most relevant here?

The general attitude I've been seeing in this thread has been one of
dismissiveness towards filesystem people. Roman too; back when he was

Oh, please don't say that, it seems like you are the only one causing
the fight. We deeply respect the opinions of file system developers, so
I invited Dave to this thread from the beginning. And you didn’t CC
linux-fsdevel@xxxxxxxxxxxxxxx yourself.

working on his shrinker debug feature I reached out to him, explained
that I was working on my own, and asked about collaborating - got
crickets in response...

Hmm..

Besides that, I haven't seen anything what-so-ever out of you guys to
make our lives easier, regarding OOM debugging, nor do you guys even
seem interested in the needs and perspectives of the filesytem people.
Roman, your feature didn't help one bit for OOM debuging - didn't even
come with documentation or hints as to what it's for.

BPF? Please.

(Disclaimer, no intention to start a fight, here are some objective
views.)

Why not? In addition to printk, there are many good debugging tools
worth trying, such as BPF related tools, drgn, etc.

For non-bcachefs developers, who knows what those statistics mean?

You can use BPF or drgn to traverse in advance to get the address of the
bcachefs shrinker structure, and then during OOM, find the bcachefs
private structure through the shrinker->private_data member, and then
dump the bcachefs private data. Is there any problem with this?

No, BPF is not an excuse for improving our OOM/allocation failure
reports. BPF/tracing are secondary tools; whenever we're logging
information about a problem we should strive to log enough information
to debug the issue.

We've got junk in there we don't need: as mentioned before, there's no
need to be dumping information on _every_ slab, we can pick the ones
using the most memory and show those.

Similarly for shrinkers, we're not going to be printing all of them -
the patchset picks the top 10 by objects and prints those. Could
probably be ~4, there's fewer shrinkers than slabs; also if we can get
shrinkers to report on memory owned in bytes, that will help too with
deciding what information is pertinent.

I'm not worried about the shrinker's general data. What I'm worried
about is the shrinker's private data. Except for the corresponding
developers, others don't know the meaning of the private statistical
data, and we have no control over the printing quantity and form of
the private data. This may indeed cause OOM log confusion and failure
to automatically parse. For this, any thoughts?

That's not a huge amount of information to be dumping, and to make it
easier to debug something that has historically been a major pain point.

There's a lot more that could be done to make our OOM reports more
readable and useful to non-mm developers. Unfortunately, any time
changing the show_mem report the immediate reaction seems to be "but
that will break my log parsing/change what I'm used to!"...