----- On Jan 10, 2018, at 12:02 PM, Tejun Heo tj@xxxxxxxxxx wrote: > Hello, Linus, Andrew. > > On Wed, Jan 10, 2018 at 05:29:00PM +0100, Petr Mladek wrote: >> Where is the acceptable compromise? I am not sure. So far, the most >> forceful people (Linus) did not see softlockups as a big problem. >> They rather wanted to see the messages. > > Can you please chime in? Would you be opposed to offloading to an > independent context even if it were only for cases where we were > already punting? The thing with the current offloading is that we > don't know who we're offloading to. It might end up in faster or > slower context, or more importantly a dangerous one. > > The particular case that we've been seeing regularly in the fleet was > the following scenario. > > 1. Console is IPMI emulated serial console. Super slow. Also > netconsole is in use. > 2. System runs out of memory, OOM triggers. > 3. OOM handler is printing out OOM debug info. > 4. While trying to emit the messages for netconsole, the network stack > / driver tries to allocate memory and then fail, which in turn > triggers allocation failure or other warning messages. printk was > already flushing, so the messages are queued on the ring. > 5. OOM handler keeps flushing but 4 repeats and the queue is never > shrinking. Because OOM handler is trapped in printk flushing, it > never manages to free memory and no one else can enter OOM path > either, so the system is trapped in this state. Hi Tejun, There appears to be two problems at hand. One is making sure a console buffer owner only flushes a bounded amount of data. Steven&Co patches seem to address this. The second problem you describe here appears to be related to the side-effects of console drivers, namely netconsole in this scenario. Its use of the network stack can allocate memory, which can fail, and therefore trigger more printk. Having a way to detect that code is directly called from a printk driver, and making sure error handling is _not_ done by pushing more printk messages to that printk driver in those situations comes to mind as a possible solution. The problem you describe seems to be _another_ issue of the current printk implementation which Steven's approach does not address, but I don't think that Steven's changes prevent doing further improvements on the netconsole driver front. I also don't see what's wrong in the incremental approach proposed by Steven. Even though it does not fix your console driver problem, his patchset appears to address some real-world latency issues. Thanks, Mathieu > > The system usually never recovers in time once this sort of condition > hits and the following was the patch that I suggested which only punts > when messages are already being punted and we can easily make it less > punty by delaying the punting by N messages. > > http://lkml.kernel.org/r/20171102135258.GO3252168@xxxxxxxxxxxxxxxxxxxxxxxxxxx > > We definitely can fix the above described case by e.g. preventing > printk flushing task from queueing more messages or whatever, but it > just seems really dumb for the system to die from things like this in > general and it doesn't really take all that much to trigger the > condition. > > Thanks. > > -- > tejun -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>