Re: [RFC PATCH 2/3] softirq: implement interrupt flood detection

Daniel Wagner <dwagner@xxxxxxx> · Thu, 2 Jan 2020 11:28:07 +0100

Hi,

On Tue, Dec 31, 2019 at 11:48:06AM +0800, Ming Lei wrote:
> On Thu, Dec 19, 2019 at 11:43:47AM +0100, Daniel Wagner wrote:
> get_util_irq() only works in case of HAVE_SCHED_AVG_IRQ which depends
> on IRQ_TIME_ACCOUNTING or PARAVIRT_TIME_ACCOUNTING.
> 
> Also rq->avg_irq.util_avg is only updated when there is scheduler
> activities. However, when interrupt flood happens, scheduler can't
> have chance to be called. Looks get_util_irq() can't be relied on
> for this task.

I am not totally sold on the idea to do so as much work as possible in
the IRQ context. I started to play with the patches from Keith [1] which
move the work to proper kernel thread.

> > ps: A customer observes the same problem as Ming is reporting.
> 
> Actually this issue should be more serious on ARM64 system, in which
> there are more CPU cores, and each CPU core is often slower than
> x86's, and each interrupt is only delivered to single CPU target.
> 
> Meantime the storage device performance is same for the two kinds of
> systems.

As it turnes out, we missed one fix 2887e41b910b ("blk-wbt: Avoid lock
contention and thundering herd issue in wbt_wait") in our enterprise
kernel which helps but doesn't solve the real cause. But as I said
moving the work out of the IRQ context will address all those
problems. Obvious there is no free lunch, let's see if we find a way
to address all the performance issues.

Thanks,
Daniel

[1] https://lore.kernel.org/linux-nvme/20191209175622.1964-1-kbusch@xxxxxxxxxx/