Re: [PATCH 00/10] OOM Debug print selection and additional information

Edward Chron <echron@xxxxxxxxxx> · Wed, 28 Aug 2019 12:46:20 -0700

On Tue, Aug 27, 2019 at 11:59 PM Michal Hocko <mhocko@xxxxxxxxxx> wrote:
>
> On Tue 27-08-19 18:07:54, Edward Chron wrote:
> > On Tue, Aug 27, 2019 at 12:15 AM Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> > >
> > > On Mon 26-08-19 12:36:28, Edward Chron wrote:
> > > [...]
> > > > Extensibility using OOM debug options
> > > > -------------------------------------
> > > > What is needed is an extensible system to optionally configure
> > > > debug options as needed and to then dynamically enable and disable
> > > > them. Also for options that produce multiple lines of entry based
> > > > output, to configure which entries to print based on how much
> > > > memory they use (or optionally all the entries).
> > >
> > > With a patch this large and adding a lot of new stuff we need a more
> > > detailed usecases described I believe.
> >
> > I guess it would make sense to explain motivation for each OOM Debug
> > option I've sent separately.
> > I see there comments on the patches I will try and add more information there.
> >
> > An overview would be that we've been collecting information on OOM's
> > over the last 12 years or so.
> > These are from switches, other embedded devices, servers both large and small.
> > We ask for feedback on what information was helpful or could be helpful.
> > We try and add it to make root causing issues easier.
> >
> > These OOM debug options are some of the options we've created.
> > I didn't port all of them to 5.3 but these are representative.
> > Our latest is kernel is a bit behind 5.3.
> >
> > >
> > >
> > > [...]
> > >
> > > > Use of debugfs to allow dynamic controls
> > > > ----------------------------------------
> > > > By providing a debugfs interface that allows options to be configured,
> > > > enabled and where appropriate to set a minimum size for selecting
> > > > entries to print, the output produced when an OOM event occurs can be
> > > > dynamically adjusted to produce as little or as much detail as needed
> > > > for a given system.
> > >
> > > Who is going to consume this information and why would that consumer be
> > > unreasonable to demand further maintenance of that information in future
> > > releases? In other words debugfs is not considered a stableAPI which is
> > > OK here but the side effect of any change to these files results in user
> > > visible behavior and we consider that more or less a stable as long as
> > > there are consumers.
> > >
> > > > OOM debug options can be added to the base code as needed.
> > > >
> > > > Currently we have the following OOM debug options defined:
> > > >
> > > > * System State Summary
> > > >   --------------------
> > > >   One line of output that includes:
> > > >   - Uptime (days, hour, minutes, seconds)
> > >
> > > We do have timestamps in the log so why is this needed?
> >
> >
> > Here is how an OOM report looks when we get it to look at:
> >
> > Aug 26 09:06:34 coronado kernel: oomprocs invoked oom-killer:
> > gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0,
> > oom_score_adj=1000
> > Aug 26 09:06:34 coronado kernel: CPU: 1 PID: 2795 Comm: oomprocs Not
> > tainted 5.3.0-rc6+ #33
> > Aug 26 09:06:34 coronado kernel: Hardware name: Compulab Ltd.
> > IPC3/IPC3, BIOS 5.12_IPC3K.PRD.0.25.7 08/09/2018
> >
> > This shows the date and time, not time of the last boot. The
> > /var/log/messages output is what we often have to look at not raw
> > dmesgs.
>
> This looks more like a configuration of the logging than a kernel
> problem. Kernel does provide timestamps for logs. E.g.
> $ tail -n1 /var/log/kern.log
> Aug 28 08:27:46 tiehlicka kernel: <1054>[336340.954345] systemd-udevd[7971]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
>
> [...]
> > > >   Example output when configured and enabled:
> > > >
> > > > Jul 22 15:20:57 yoursystem kernel: Threads:530 Processes:279 forks_since_boot:2786 procs_runable:2 procs_iowait:0
> > > >
> > > > * ARP Table and/or Neighbour Discovery Table Summary
> > > >   --------------------------------------------------
> > > >   One line of output each for ARP and ND that includes:
> > > >   - Table name
> > > >   - Table size (max # entries)
> > > >   - Key Length
> > > >   - Entry Size
> > > >   - Number of Entries
> > > >   - Last Flush (in seconds)
> > > >   - hash grows
> > > >   - entry allocations
> > > >   - entry destroys
> > > >   - Number lookups
> > > >   - Number of lookup hits
> > > >   - Resolution failures
> > > >   - Garbage Collection Forced Runs
> > > >   - Table Full
> > > >   - Proxy Queue Length
> > > >
> > > >   Example output when configured and enabled (for both):
> > > >
> > > > ... kernel: neighbour: Table: arp_tbl size:   256 keyLen:  4 entrySize: 360 entries:     9 lastFlush:  1721s hGrows:     1 allocs:     9 destroys:     0 lookups:   204 hits:   199 resFailed:    38 gcRuns/Forced: 111 /  0 tblFull:  0 proxyQlen:  0
> > > >
> > > > ... kernel: neighbour: Table:  nd_tbl size:   128 keyLen: 16 entrySize: 368 entries:     6 lastFlush:  1720s hGrows:     0 allocs:     7 destroys:     1 lookups:     0 hits:     0 resFailed:     0 gcRuns/Forced: 110 /  0 tblFull:  0 proxyQlen:  0
> > >
> > > Again, why is this needed particularly for the OOM event? I do
> > > understand this might be useful system health diagnostic information but
> > > how does this contribute to the OOM?
> > >
> >
> > It is example of some system table information we print.
> > Other adjustable table information may be useful as well.
> > These table sizes are often adjustable and collecting stats on usage
> > helps determine if settings are appropriate.
> > The value during OOM events is very useful as usage varies.
> > We also collect the same stats like this from user code periodically
> > and can compare these.
>
> I suspect that this is a very narrow usecase and there are more like
> that and I can imagine somebody with a different workload could come up
> with yet another set of useful information to print. The more I think of these
> additional modules the more I am convinced that this "plugin" architecture
> is a wrong approach. Why? Mostly because all the code maintenance burden
> is likely to be not worth all the niche usecase. This all has to be more
> dynamic and ideally scriptable so that the code in the kernel just
> provides the basic information and everybody can just hook in there and
> dump whatever additional information is needed. Sounds like something
> that eBPF could fit in, no? Have you considered that?
>

Just a comment, I have no doubt that the people commenting here have felt
the pain of trying to root cause OOM issues and you know how difficult it can
be. Having the information you need since you can't easily reproduce the 
problem is really your best hope. As much I wish we could take a full dump
of the system when an OOM event occurs that just isn't practical on productionsystems (would make my job and other people's jobs easier).

What we coded was quite reasonable given when we started back with 2.6.38
and then 3.4. We've been tailoring OOM to give us the information we needsince then. It isn't perfect but it's a significant improvement. So again we've 
offered it up to start this discussion, with the goal of figuring out "how to get there"
as Michal says. 

Going forward we remain flexible as to implementation. 
We "just want to get there" too. 

Yes, we have thought about eBPF. 
Mentioned this in the discussion with Qian Cai.
But with the caveat that running a eBPF script that it isn't standard Linux
operating procedure, at this point in time any way will not be well
received in the data center.

Our belief is if you really think eBPF is the preferred mechanism
then move OOM reporting to an eBPF. 
I mentioned this before but I will reiterate this here.

So how do we get there? Let's look at the existing report which we know
has issues.

Other than a few essential OOM messages the OOM code should produce,
such as the Killed process message message sequence being included,
you could have the entire OOM report moved to an eBPF script and 
therefore make it customizable, configurable or if you prefer programmable.

Why? Because as we all agree, you'll never have a perfect OOM Report.
So if you believe this, than if you will, put your money where your mouth
is (so to speak) and make the entire OOM Report and eBPF script.
We'd be willing to help with this.

I'll give specific reasons why you want to do this.
Don't want to maintain a lot of code in the kernel (eBPF code doesn't count).
Can't produce an ideal OOM report.
Don't like configuring things but favor programmatic solutions.
Agree the existing OOM report doesn't work for all environments.
Want to allow flexibility but can't support everything people might want.
Then installing an eBPF for OOM Reporting isn't an option, it's required.
The last reason is huge for people who live in a world with large data
centers. Data center managers are very conservative. They don't want to
deviate from standard operating procedure unless absolutely necessary.
If loading an OOM Report eBPF is standard to get OOM Reporting output,
then they'll accept that. 

For anyone who argues that the existing report should remain as is
with no changes, well our experience tells us otherwise.

I can go on about this for more than you'll care to read but I'll
highlight a few issues that should be obvious.

Having a choice of all processes dumped as either all user processes or none
doesn't work in general. For some environments or situations it makes sense
to only dump the largest memory using tasks. As I've already mentioned you
can configure this choice or you could make a decision based on the number
of processes on the system or you can use some other arbitrary decision 
process. This will can help to avoid kernel print rate limiting make print
decisions for you. Having an eBPF script lets you program what you want.

Printing slabs only when slabs use more memory than user processes isn't
sufficient. We've had a number of cases where slabs use 30% or 40% of 
system memory and that is abnormal for that system and we want slab usage
at the time of OOM event. Putting that in eBPF script makes that adjustable.

We have found printing large Vmalloc allocations helpful and an eBPF script
would alow us to add that.

We'd like to dump the status of a few kernel tables and we should be able to
add that to eBPF OOM Reporting script.

Including information as we do about task status of the system is essential
for us. We've debugged so many issues because of this information.
We can add this to the OOM Reporting eBPF script.

So yeah, we agree with using eBPF if you agree to using it as well.

If you don't agree, then we want to configure or otherwise mod your
code because we know our environment way better than you do
and sadly this stuff matters or we wouldn't be having this discussion.

> [...]
>
> Skipping over many useful stuff. I can reassure you that my experience
> with OOM debugging has been a real pain at times (e.g. when there is
> simply no way to find out who has eaten all the memory because it is not
> accounted anywhere) as well and I completely understand where you are
> coming from. There is definitely a room for improvements we just have to
> find a way how to get there.
>
> Thanks!
> --
> Michal Hocko
> SUSE Labs

Agreed. What do you think?

Thank-you,

Edward Chron
Arista Networks