Michal Hocko wrote: > On Fri 02-06-17 20:13:32, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Thu 01-06-17 22:11:13, Tetsuo Handa wrote: > > >> Michal Hocko wrote: > > >>> On Thu 01-06-17 20:43:47, Tetsuo Handa wrote: > > >>>> Cong Wang has reported a lockup when running LTP memcg_stress test [1]. > > >>> > > >>> This seems to be on an old and not pristine kernel. Does it happen also > > >>> on the vanilla up-to-date kernel? > > >> > > >> 4.9 is not an old kernel! It might be close to the kernel version which > > >> enterprise distributions would choose for their next long term supported > > >> version. > > >> > > >> And please stop saying "can you reproduce your problem with latest > > >> linux-next (or at least latest linux)?" Not everybody can use the vanilla > > >> up-to-date kernel! > > > > > > The changelog mentioned that the source of stalls is not clear so this > > > might be out-of-tree patches doing something wrong and dump_stack > > > showing up just because it is called often. This wouldn't be the first > > > time I have seen something like that. I am not really keen on adding > > > heavy lifting for something that is not clearly debugged and based on > > > hand waving and speculations. > > > > You are asking users to prove that the problem is indeed in the MM subsystem, > > but you are thinking that kmallocwd which helps users to check whether the > > problem is indeed in the MM subsystem is not worth merging into mainline. > > As a result, we have to try things based on what you think handwaving and > > speculations. This is a catch-22. If you don't want handwaving/speculations, > > please please do provide a mechanism for checking (a) and (b) shown later. > > configure watchdog to bug on soft lockup, take a crash dump, see what > is going on there and you can draw a better picture of what is going on > here. Seriously I am fed up with all the "let's do the async thing > because it would tell much more" side discussions. You are trying to fix > a soft lockup which alone is not a deadly condition. Wrong and unacceptable. I'm trying to collect information for not only the cause of allocation stall reported by Cong Wang but also any allocation stalls (e.g. infinite too_many_isolated() trap from shrink_inactive_list() and infinite mempool_alloc() loop case). I was indeed surprised to know how severe a race condition on a system with many CPUs at http://lkml.kernel.org/r/201409192053.IHJ35462.JLOMOSOFFVtQFH@xxxxxxxxxxxxxxxxxxx . Therefore, like you suspect, it is possible that dump_stack() serialization might be unfair on some hardware, especially when a lot of allocating threads are calling warn_alloc(). But if you read Cong's report, you will find that allocation stalls began before the first detection of soft lockup due to dump_stack() serialization. This suggests that allocation stalls already began before the system enters into a state where a lot of allocating threads started calling warn_alloc(). Then, I think that it is natural to suspect that the first culprit is MM subsystem. If you believe that you can debug and explain why the OOM killer was not invoked in Cong's case without comparing how situation changes over time (i.e. with a crash dump only), please explain it in that thread. > If the system is > overwhelmed it can happen and if that is the case then we should care > whether it gets resolved or it is a permanent livelock situation. What I'm trying to check is exactly whether the stall gets resolved or the stall is a permanent livelock situation, via allowing users to take multiple snapshots of memory image and compare how situation changes over time. Taking a crash dump via kernel panic cannot allow users to take multiple snapshots of memory image. > If yes > then we need to isolate which path is not preempting and why and place > the cond_resched there. The page allocator contains preemption points, > if we are lacking some for some pathological paths let's add them. What I'm talking about is not the lack of cond_resched(). As far as we know, cond_resched() is inserted whereever necessary. What I'm talking about is that cond_resched() cannot yield enough CPU time. Looping with calling cond_resched() can yield only e.g. 1% of CPU time compared to looping without calling cond_resched(). But if e.g. 100 threads are looping with only cond_resched(), some thread waiting for wake up from schedule_timeout_killable(1) can delay a lot. If that some thread has idle priority, the delay can become too long to wait. > For > some reason you seem to be focused only on the warn_alloc path, though, > while the real issue might be somewhere completely else. Not only because uncontrolled concurrent warn_alloc() causes flooding of printk() and prevents users from knowing what is going on, but also warn_alloc() cannot be called when we hit e.g. infinite too_many_isolated() loop case or infinite mempool_alloc() loop case which after all prevents users from knowing what is going on. You are not providing users means to skip warn_alloc() when flooding of printk() is just a junk (i.e. the real issue is somewhere completely else) or call warn_alloc() when all printk() messages are needed (i.e. the issue is in the MM subsystem). Current implementation of warn_alloc() can act as launcher of corrupted spam messages. Without comparing how situation changes over time, via reliably saving all printk() messages which might potentially be relevant, we can't judge whether warn_alloc() flooding is just a collateral phenomenon of a real issue somewhere completely else. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>