On Thu, 2024-09-26 at 11:43 +0200, Marc Kleine-Budde wrote: > On 26.09.2024 11:19:53, Matthias Schiffer wrote: > > On Tue, 2024-09-24 at 08:08 +0200, Markus Schneider-Pargmann wrote: > > > > > > On Mon, Sep 23, 2024 at 05:32:16PM GMT, Matthias Schiffer wrote: > > > > The interrupt line of PCI devices is interpreted as edge-triggered, > > > > however the interrupt signal of the m_can controller integrated in Intel > > > > Elkhart Lake CPUs appears to be generated level-triggered. > > > > > > > > Consider the following sequence of events: > > > > > > > > - IR register is read, interrupt X is set > > > > - A new interrupt Y is triggered in the m_can controller > > > > - IR register is written to acknowledge interrupt X. Y remains set in IR > > > > > > > > As at no point in this sequence no interrupt flag is set in IR, the > > > > m_can interrupt line will never become deasserted, and no edge will ever > > > > be observed to trigger another run of the ISR. This was observed to > > > > result in the TX queue of the EHL m_can to get stuck under high load, > > > > because frames were queued to the hardware in m_can_start_xmit(), but > > > > m_can_finish_tx() was never run to account for their successful > > > > transmission. > > > > > > > > To fix the issue, repeatedly read and acknowledge interrupts at the > > > > start of the ISR until no interrupt flags are set, so the next incoming > > > > interrupt will also result in an edge on the interrupt line. > > > > > > > > Fixes: cab7ffc0324f ("can: m_can: add PCI glue driver for Intel Elkhart Lake") > > > > Signed-off-by: Matthias Schiffer <matthias.schiffer@xxxxxxxxxxxxxxx> > > > > > > Just a few comment nitpicks below. Otherwise: > > > > > > Reviewed-by: Markus Schneider-Pargmann <msp@xxxxxxxxxxxx> > > > > > > We have received a report that while this patch fixes a stuck queue issue reproducible with cangen, > > the problem has not disappeared with our customer's application. I will hold off sending a new > > version of the patch while we're investigating whether there is a separate issue with the same > > symptoms or the patch is insufficient. > > > > Patch 1/2 should be good to go and could be applied independently. > > Can you post the reproducer here, too. So that we can add it to the > patch or at least reference to it. > > regards, > Marc Something like the following results in a stuck queue after a few minutes without this patch, and ran without issue for 2.5h with the patch (with can0 and can1 of the Elkhart Lake connected to each other): --- ip link set can0 up type can bitrate 1000000 ip link set can1 up type can bitrate 1000000 cangen can1 -g 2 -I 100 -L 8 & cangen can1 -g 2 -I 101 -L 8 & cangen can1 -g 2 -I 102 -L 8 & cangen can1 -g 2 -I 103 -L 8 & cangen can1 -g 2 -I 104 -L 8 & cangen can1 -g 2 -I 105 -L 8 & cangen can1 -g 2 -I 106 -L 8 & cangen can1 -g 2 -I 107 -L 8 & cangen can0 -g 2 -I 000 -L 8 & cangen can0 -g 2 -I 001 -L 8 & cangen can0 -g 2 -I 002 -L 8 & cangen can0 -g 2 -I 003 -L 8 & cangen can0 -g 2 -I 004 -L 8 & cangen can0 -g 2 -I 005 -L 8 & cangen can0 -g 2 -I 006 -L 8 & cangen can0 -g 2 -I 007 -L 8 & stress-ng --matrix 0 & --- I will add the reproducer to the commit message in v4. I'm also not sure if the stress-ng actually has any effect, I'll verify that before the next version of the patch. Matthias > -- TQ-Systems GmbH | Mühlstraße 2, Gut Delling | 82229 Seefeld, Germany Amtsgericht München, HRB 105018 Geschäftsführer: Detlef Schneider, Rüdiger Stahl, Stefan Schneider https://www.tq-group.com/