On Tue, Aug 20, 2019 at 10:33:38AM -0700, Sagi Grimberg wrote: > > > From: Long Li <longli@xxxxxxxxxxxxx> > > > > When a NVMe hardware queue is mapped to several CPU queues, it is possible > > that the CPU this hardware queue is bound to is flooded by returning I/O for > > other CPUs. > > > > For example, consider the following scenario: > > 1. CPU 0, 1, 2 and 3 share the same hardware queue > > 2. the hardware queue interrupts CPU 0 for I/O response > > 3. processes from CPU 1, 2 and 3 keep sending I/Os > > > > CPU 0 may be flooded with interrupts from NVMe device that are I/O responses > > for CPU 1, 2 and 3. Under heavy I/O load, it is possible that CPU 0 spends > > all the time serving NVMe and other system interrupts, but doesn't have a > > chance to run in process context. > > > > To fix this, CPU 0 can schedule a work to complete the I/O request when it > > detects the scheduler is not making progress. This serves multiple purposes: > > > > 1. This CPU has to be scheduled to complete the request. The other CPUs can't > > issue more I/Os until some previous I/Os are completed. This helps this CPU > > get out of NVMe interrupts. > > > > 2. This acts a throttling mechanisum for NVMe devices, in that it can not > > starve a CPU while servicing I/Os from other CPUs. > > > > 3. This CPU can make progress on RCU and other work items on its queue. > > The problem is indeed real, but this is the wrong approach in my mind. > > We already have irqpoll which takes care proper budgeting polling > cycles and not hogging the cpu. The issue isn't unique to NVMe, and can be any fast devices which interrupts CPU too frequently, meantime the interrupt/softirq handler may take a bit much time, then CPU is easy to be lockup by the interrupt/sofirq handler, especially in case that multiple submission CPUs vs. single completion CPU. Some SCSI devices has the same problem too. Could we consider to add one generic mechanism to cover this kind of problem? One approach I thought of is to allocate one backup thread for handling such interrupt, which can be marked as IRQF_BACKUP_THREAD by drivers. Inside do_IRQ(), irqtime is accounted, before calling action->handler(), check if this CPU has taken too long time for handling IRQ(interrupt or softirq) and see if this CPU could be lock up. If yes, wakeup the backup thread to handle the interrupt for avoiding lockup this CPU. The threaded interrupt framework is there, and this way could be easier to implement. Meantime most time the handler is run in interrupt context and we may avoid the performance loss when CPU isn't busy enough. Any comment on this approach? Thanks, Ming