On Wed, May 26, 2021 at 11:34 AM Willem de Bruijn <willemdebruijn.kernel@xxxxxxxxx> wrote: > > On Wed, May 26, 2021 at 4:24 AM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: > > > > > > With the implementation of napi-tx in virtio driver, we clean tx > > descriptors from rx napi handler, for the purpose of reducing tx > > complete interrupts. But this introduces a race where tx complete > > interrupt has been raised, but the handler finds there is no work to do > > because we have done the work in the previous rx interrupt handler. > > A similar issue exists with polling from start_xmit, it is however > > less common because of the delayed cb optimization of the split ring - > > but will likely affect the packed ring once that is more common. > > > > In particular, this was reported to lead to the following warning msg: > > [ 3588.010778] irq 38: nobody cared (try booting with the > > "irqpoll" option) > > [ 3588.017938] CPU: 4 PID: 0 Comm: swapper/4 Not tainted > > 5.3.0-19-generic #20~18.04.2-Ubuntu > > [ 3588.017940] Call Trace: > > [ 3588.017942] <IRQ> > > [ 3588.017951] dump_stack+0x63/0x85 > > [ 3588.017953] __report_bad_irq+0x35/0xc0 > > [ 3588.017955] note_interrupt+0x24b/0x2a0 > > [ 3588.017956] handle_irq_event_percpu+0x54/0x80 > > [ 3588.017957] handle_irq_event+0x3b/0x60 > > [ 3588.017958] handle_edge_irq+0x83/0x1a0 > > [ 3588.017961] handle_irq+0x20/0x30 > > [ 3588.017964] do_IRQ+0x50/0xe0 > > [ 3588.017966] common_interrupt+0xf/0xf > > [ 3588.017966] </IRQ> > > [ 3588.017989] handlers: > > [ 3588.020374] [<000000001b9f1da8>] vring_interrupt > > [ 3588.025099] Disabling IRQ #38 > > > > This patchset attempts to fix this by cleaning up a bunch of races > > related to the handling of sq callbacks (aka tx interrupts). > > Somewhat tested but I couldn't reproduce the original issues > > reported, sending out for help with testing. > > > > Wei, does this address the spurious interrupt issue you are > > observing? Could you confirm please? > > Thanks for working on this, Michael. Wei is on leave. I'll try to reproduce. The original report was generated with five GCE virtual machines sharing a sole-tenant node, together sending up to 160 netperf tcp_stream connections to 16 other instances. Running Ubuntu 20.04-LTS with kernel 5.4.0-1034-gcp. But the issue can also be reproduced with just two n2-standard-16 instances, running neper tcp_stream with high parallelism (-T 16 -F 240). It's a bit faster to trigger by reducing the interrupt count threshold from 99.9K/100K to 9.9K/10K. And I added additional logging to report the unhandled rate even if lower. Unhandled interrupt rate scales with the number of queue pairs (`ethtool -L $DEV combined $NUM`). It is essentially absent at 8 queues, at around 90% at 14 queues. By default these GCE instances have one rx and tx interrupt per core, so 16 each. With the rx and tx interrupts for a given virtio-queue pinned to the same core. Unfortunately, commit 3/4 did not have a significant impact on these numbers. Have to think a bit more about possible mitigations. At least I'll be able to test the more easily now. _______________________________________________ Virtualization mailing list Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/virtualization