Hi, On 06/09/2019 03:48, Ming Lei wrote: [ ... ] >> You did not share yet the analysis of the problem (the kernel warnings >> give the symptoms) and gave the reasoning for the solution. It is hard >> to understand what you are looking for exactly and how to connect the dots. > > Let me explain it one more time:> > When one IRQ flood happens on one CPU: > > 1) softirq handling on this CPU can't make progress > > 2) kernel thread bound to this CPU can't make progress > > For example, network may require softirq to xmit packets, or another irq > thread for handling keyboards/mice or whatever, or rcu_sched may depend > on that CPU for making progress, then the irq flood stalls the whole > system. > >> >> AFAIU, there are fast medium where the responses to requests are faster >> than the time to process them, right? > > Usually medium may not be faster than CPU, now we are talking about > interrupts, which can be originated from lots of devices concurrently, > for example, in Long Li'test, there are 8 NVMe drives involved. > >> >> I don't see how detecting IRQ flooding and use a threaded irq is the >> solution, can you explain? > > When IRQ flood is detected, we reserve a bit little time for providing > chance to make softirq/threads scheduled by scheduler, then the above > problem can be avoided. > >> >> If the responses are coming at a very high rate, whatever the solution >> (interrupts, threaded interrupts, polling), we are still in the same >> situation. > > When we moving the interrupt handling into irq thread, other softirq/ > threaded interrupt/thread gets chance to be scheduled, so we can avoid > to stall the whole system. Ok, so the real problem is per-cpu bounded tasks. I share Thomas opinion about a NAPI like approach. I do believe you should also rely on the IRQ_TIME_ACCOUNTING (may be get it optimized) to contribute to the CPU load and enforce task migration at load balance. >> My suggestion was initially to see if the interrupt load will be taken >> into accounts in the cpu load and favorize task migration with the >> scheduler load balance to a less loaded CPU, thus the CPU processing >> interrupts will end up doing only that while other CPUs will handle the >> "threaded" side. >> >> Beside that, I'm wondering if the block scheduler should be somehow >> involved in that [1] > > For NVMe or any multi-queue storage, the default scheduler is 'none', > which basically does nothing except for submitting IO asap. > > > Thanks, > Ming > -- <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook | <http://twitter.com/#!/linaroorg> Twitter | <http://www.linaro.org/linaro-blog/> Blog