Re: [RFC PATCH] nvme: always return IRQ_HANDLED

Jens Axboe <axboe@xxxxxxxxx> · Thu, 17 Aug 2017 14:17:08 -0600

On 08/17/2017 02:15 PM, Keith Busch wrote:
> On Thu, Aug 17, 2017 at 01:32:20PM -0600, Jens Axboe wrote:
>> We currently have an issue with nvme when polling is used. Just
>> ran some testing on 4.13-rc5, and it's trivial to trigger an IRQ
>> disable ala:
>>
>> [   52.412851] irq 77: nobody cared (try booting with the "irqpoll" option)
>> [   52.415310] irq 70: nobody cared (try booting with the "irqpoll" option)
>>
>> when running a few processes polling. The reason is pretty obvious - if
>> we're effective at polling, the triggered IRQ will never find any
>> events. If this happens enough times in a row, the kernel disables our
>> IRQ since we keep returning IRQ_NONE.
> 
> If you're seeing IRQ_NONE returned, the NVMe driver didn't poll any
> completions since the last time nvme_irq was called. The cqe_seen on
> polled compeletions is sticky until the IRQ handler is run, in which
> case it returns IRQ_HANDLED even when no completions were handled during
> that interrupt.
> 
> The only way it should be able to return IRQ_NONE is if no completions
> were observed (polled or otherwise) since the last time the IRQ handler
> was called.

The polling do not update the cqe_seen. So it's possible that every time
the IRQ handler does trigger, there are no entries found. Maybe a better
or simpler fix would be to have the polling set cqe_seen to true, and
leave the clearing to the interrupt handler as is done now.

>> Question is, what's the best way to solve this. Ideally we should not be
>> triggering an IRQ at all, but that's still not in mainline. Can we
>> safely just return IRQ_HANDLED always? That should work except for the
>> case where we happen to run into an IRQ flood where DO want to turn off
>> the nvme irq. For now, I think that's small price to pay, since the
>> current issue is much worse and leaves the device in a weird non-working
>> state where some queue interrupts are turned off.
> 
> My recommended way to get this handled is to enable interrupt coalescing
> and have controllers behave as the specification describes to suppress
> interrupts when polling is active. From section 5.21.1.8:
> 
>   Specifically, if the Completion Queue Head Doorbell register is being
>   updated that is associated with a particular interrupt vector, then
>   the controller has a positive indication that completion queue entries
>   are already being processed. In this case, the aggregation time and/or
>   the aggregation threshold may be reset/restarted upon the associated
>   register write. This may result in interrupts being delayed indefinitely
>   in certain workloads where the aggregation time or aggregation threshold
>   is non-zero.

That would indeed be much better, and not require is to fiddle with IRQ
less completion queues for more efficient polling. Of the test devices that
I do use, not all of them support coalescing, and of the ones that do, some
of them implement it in a pretty poor fashion.

-- 
Jens Axboe