On Tue, 2024-07-02 at 07:37 +0200, Linux regression tracking (Thorsten Leemhuis) wrote: > > > On 01.07.24 16:34, Markus Schneider-Pargmann wrote: > > On Mon, Jul 01, 2024 at 02:12:55PM GMT, Linux regression tracking (Thorsten Leemhuis) wrote: > > > [CCing the regression list, as it should be in the loop for regressions: > > > https://docs.kernel.org/admin-guide/reporting-regressions.html] > > > > > > Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting > > > for once, to make this easily accessible to everyone. > > > > > > Hmm, looks like there was not even a single reply to below regression > > > report. But also seens Markus hasn't posted anything archived on Lore > > > since about three weeks now, so he might be on vacation. > > > > > > Marc, do you might have an idea what's wrong with the culprit? Or do we > > > expected Markus to be back in action soon? > > > > Great, ping here. > > Thx for replying! > > > @Matthias: Thanks for debugging and sorry for breaking it. If you have a > > fix for this, let me know. I have a lot of work right now, so I am not > > sure when I will have a proper fix ready. But it is on my todo list. > > Thx. This made me wonder: is "revert the culprit to resolve this quickly > and reapply it later together with a fix" something that we should > consider if a proper fix takes some time? Or is this not worth it in > this case or extremely hard? Or would it cause a regression on it's own > for users of 6.9? > > Ciao, Thorsten Hi, I think on 6.9 a revert is not easily possible (without reverting several other commits adding new features), but it should be considered for 6.6. I don't think further regressions are possible by reverting, as on 6.6 the timer is only used for platforms without an m_can IRQ, and on these platforms the current behavior is "the kernel reproducibly deadlocks in atomic context", so there is not much room for making it worse. Like Markus, I have writing a proper fix for this on my TODO list, but I'm not sure when I can get to it - hopefully next week. Best regards, Matthias > > > > On 18.06.24 18:12, Matthias Schiffer wrote: > > > > Hi Markus, > > > > > > > > we've found that recent kernels hang on the TI AM62x SoC (where no m_can interrupt is available and > > > > thus the polling timer is used), always a few seconds after the CAN interfaces are set up. > > > > > > > > I have bisected the issue to commit a163c5761019b ("can: m_can: Start/Cancel polling timer together > > > > with interrupts"). Both master and 6.6 stable (which received a backport of the commit) are > > > > affected. On 6.6 the commit is easy to revert, but on master a lot has happened on top of that > > > > change. > > > > > > > > As far as I can tell, the reason is that hrtimer_cancel() tries to cancel the timer synchronously, > > > > which will deadlock when called from the hrtimer callback itself (hrtimer_callback -> m_can_isr -> > > > > m_can_disable_all_interrupts -> hrtimer_cancel). > > > > > > > > I can try to come up with a fix, but I think you are much more familiar with the driver code. Please > > > > let me know if you need any more information. > > > > > > > > Best regards, > > > > Matthias > > > > > > > > > > > > -- TQ-Systems GmbH | Mühlstraße 2, Gut Delling | 82229 Seefeld, Germany Amtsgericht München, HRB 105018 Geschäftsführer: Detlef Schneider, Rüdiger Stahl, Stefan Schneider https://www.tq-group.com/