Re: [PATCH] mm,page_alloc: softlockup on warn_alloc on

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Fri, 15 Sep 2017 23:12:24 +0900

Michal Hocko wrote:
> On Fri 15-09-17 21:09:29, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 15-09-17 20:38:49, Tetsuo Handa wrote:
> > > [...]
> > > > You said "identify _why_ we see the lockup trigerring in the first
> > > > place" without providing means to identify it. Unless you provide
> > > > means to identify it (in a form which can be immediately and easily
> > > > backported to 4.9 kernels; that is, backporting not-yet-accepted
> > > > printk() offloading patchset is not a choice), this patch cannot be
> > > > refused.
> > > 
> > > I fail to see why. It simply workarounds an existing problem elsewhere
> > > in the kernel without deeper understanding on where the problem is. You
> > > can add your own instrumentation to debug and describe the problem. This
> > > is no different to any other kernel bugs...
> > 
> > Please do show us your patch for that. Normal users cannot afford developing
> > such instrumentation to debug and describe the problem.
> 
> Stop this nonsense already! Any kernel bug/lockup needs a debugging
> which might be non-trivial and it is necessary to understand the real
> culprit. We do not add random hacks to silence a problem. We aim at
> fixing it!

Assuming that Wang Yu's trace has

  RIP: 0010:[<...>]  [<...>] dump_stack+0x.../0x...

line in the omitted part (like Cong Wang's trace did), I suspect that a thread
which is holding dump_lock is unable to leave console_unlock() from printk() for
so long because many other threads are trying to call printk() from warn_alloc()
while consuming all CPU time.

Thus, not allowing other threads to consume CPU time / call printk() is a step for
isolating it. If this problem still exists even if we made other threads sleep,
the real cause will be somewhere else. But unfortunately Cong Wang has not yet
succeeded with reproducing the problem. If Wang Yu is able to reproduce the problem,
we can try setting 1 to /proc/sys/kernel/softlockup_all_cpu_backtrace so that
we can know what other CPUs are doing.

>  
> > > If our printk implementation is so weak it cannot cope with writers then
> > > that should be fixed without spreading hacks in different subsystems. If
> > > the lockup is a real problem under normal workloads (rather than
> > > artificial ones) then we should try to throttle more aggresively.
> > 
> > No throttle please. Throttling makes warn_alloc() more and more useless.
> 
> so does try_lock approach...

There is mutex_lock() approach, but you don't agree on using it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>