Hi, On Tue, Dec 31, 2019 at 11:48:06AM +0800, Ming Lei wrote: > On Thu, Dec 19, 2019 at 11:43:47AM +0100, Daniel Wagner wrote: > get_util_irq() only works in case of HAVE_SCHED_AVG_IRQ which depends > on IRQ_TIME_ACCOUNTING or PARAVIRT_TIME_ACCOUNTING. > > Also rq->avg_irq.util_avg is only updated when there is scheduler > activities. However, when interrupt flood happens, scheduler can't > have chance to be called. Looks get_util_irq() can't be relied on > for this task. I am not totally sold on the idea to do so as much work as possible in the IRQ context. I started to play with the patches from Keith [1] which move the work to proper kernel thread. > > ps: A customer observes the same problem as Ming is reporting. > > Actually this issue should be more serious on ARM64 system, in which > there are more CPU cores, and each CPU core is often slower than > x86's, and each interrupt is only delivered to single CPU target. > > Meantime the storage device performance is same for the two kinds of > systems. As it turnes out, we missed one fix 2887e41b910b ("blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait") in our enterprise kernel which helps but doesn't solve the real cause. But as I said moving the work out of the IRQ context will address all those problems. Obvious there is no free lunch, let's see if we find a way to address all the performance issues. Thanks, Daniel [1] https://lore.kernel.org/linux-nvme/20191209175622.1964-1-kbusch@xxxxxxxxxx/