Re: [PATCH v3 2/2] can: m_can: fix missed interrupts with m_can_pci

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 2024-09-26 at 11:43 +0200, Marc Kleine-Budde wrote:
> On 26.09.2024 11:19:53, Matthias Schiffer wrote:
> > On Tue, 2024-09-24 at 08:08 +0200, Markus Schneider-Pargmann wrote:
> > > 
> > > On Mon, Sep 23, 2024 at 05:32:16PM GMT, Matthias Schiffer wrote:
> > > > The interrupt line of PCI devices is interpreted as edge-triggered,
> > > > however the interrupt signal of the m_can controller integrated in Intel
> > > > Elkhart Lake CPUs appears to be generated level-triggered.
> > > > 
> > > > Consider the following sequence of events:
> > > > 
> > > > - IR register is read, interrupt X is set
> > > > - A new interrupt Y is triggered in the m_can controller
> > > > - IR register is written to acknowledge interrupt X. Y remains set in IR
> > > > 
> > > > As at no point in this sequence no interrupt flag is set in IR, the
> > > > m_can interrupt line will never become deasserted, and no edge will ever
> > > > be observed to trigger another run of the ISR. This was observed to
> > > > result in the TX queue of the EHL m_can to get stuck under high load,
> > > > because frames were queued to the hardware in m_can_start_xmit(), but
> > > > m_can_finish_tx() was never run to account for their successful
> > > > transmission.
> > > > 
> > > > To fix the issue, repeatedly read and acknowledge interrupts at the
> > > > start of the ISR until no interrupt flags are set, so the next incoming
> > > > interrupt will also result in an edge on the interrupt line.
> > > > 
> > > > Fixes: cab7ffc0324f ("can: m_can: add PCI glue driver for Intel Elkhart Lake")
> > > > Signed-off-by: Matthias Schiffer <matthias.schiffer@xxxxxxxxxxxxxxx>
> > > 
> > > Just a few comment nitpicks below. Otherwise:
> > > 
> > > Reviewed-by: Markus Schneider-Pargmann <msp@xxxxxxxxxxxx>
> > 
> > 
> > We have received a report that while this patch fixes a stuck queue issue reproducible with cangen,
> > the problem has not disappeared with our customer's application. I will hold off sending a new
> > version of the patch while we're investigating whether there is a separate issue with the same
> > symptoms or the patch is insufficient.
> > 
> > Patch 1/2 should be good to go and could be applied independently.
> 
> Can you post the reproducer here, too. So that we can add it to the
> patch or at least reference to it.
> 
> regards,
> Marc

Something like the following results in a stuck queue after a few minutes without this patch, and
ran without issue for 2.5h with the patch (with can0 and can1 of the Elkhart Lake connected to each
other):

---
ip link set can0 up type can bitrate 1000000
ip link set can1 up type can bitrate 1000000

cangen can1 -g 2 -I 100 -L 8 &
cangen can1 -g 2 -I 101 -L 8 &
cangen can1 -g 2 -I 102 -L 8 &
cangen can1 -g 2 -I 103 -L 8 &
cangen can1 -g 2 -I 104 -L 8 &
cangen can1 -g 2 -I 105 -L 8 &
cangen can1 -g 2 -I 106 -L 8 &
cangen can1 -g 2 -I 107 -L 8 &

cangen can0 -g 2 -I 000 -L 8 &
cangen can0 -g 2 -I 001 -L 8 &
cangen can0 -g 2 -I 002 -L 8 &
cangen can0 -g 2 -I 003 -L 8 &
cangen can0 -g 2 -I 004 -L 8 &
cangen can0 -g 2 -I 005 -L 8 &
cangen can0 -g 2 -I 006 -L 8 &
cangen can0 -g 2 -I 007 -L 8 &

stress-ng --matrix 0 &
---

I will add the reproducer to the commit message in v4. I'm also not sure if the stress-ng actually
has any effect, I'll verify that before the next version of the patch.

Matthias


> 

-- 
TQ-Systems GmbH | Mühlstraße 2, Gut Delling | 82229 Seefeld, Germany
Amtsgericht München, HRB 105018
Geschäftsführer: Detlef Schneider, Rüdiger Stahl, Stefan Schneider
https://www.tq-group.com/





[Index of Archives]     [Automotive Discussions]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]     [CAN Bus]

  Powered by Linux