On 02.07.24 12:03, Matthias Schiffer wrote: > On Tue, 2024-07-02 at 07:37 +0200, Linux regression tracking (Thorsten Leemhuis) wrote: >> On 01.07.24 16:34, Markus Schneider-Pargmann wrote: >>> On Mon, Jul 01, 2024 at 02:12:55PM GMT, Linux regression tracking (Thorsten Leemhuis) wrote: > >>> @Matthias: Thanks for debugging and sorry for breaking it. If you have a >>> fix for this, let me know. I have a lot of work right now, so I am not >>> sure when I will have a proper fix ready. But it is on my todo list. >> >> Thx. This made me wonder: is "revert the culprit to resolve this quickly >> and reapply it later together with a fix" something that we should >> consider if a proper fix takes some time? Or is this not worth it in >> this case or extremely hard? Or would it cause a regression on it's own >> for users of 6.9? > > I think on 6.9 a revert is not easily possible (without reverting several other commits adding new > features), but it should be considered for 6.6. >> I don't think further regressions are possible by reverting, as on 6.6 the timer is only used for > platforms without an m_can IRQ, and on these platforms the current behavior is "the kernel > reproducibly deadlocks in atomic context", so there is not much room for making it worse. Often Greg does not revert commits in a stable branches when they cause the same problem in mainline. But I suspect in this case it is something different. But I guess he would prefer to hear "please revert 887407b622f8e4 ("can: m_can: Start/Cancel polling timer together with interrupts")" coming from Markus, hence: Markus, if you agree that a revert from 6.6.y might be best, could you simply ask for a revert in a reply to this mail while CCing Greg and the stable list? tia! Ciao, Thorsten > Like Markus, I have writing a proper fix for this on my TODO list, but I'm not sure when I can get > to it - hopefully next week. > > Best regards, > Matthias > > > >> >>>> On 18.06.24 18:12, Matthias Schiffer wrote: >>>>> Hi Markus, >>>>> >>>>> we've found that recent kernels hang on the TI AM62x SoC (where no m_can interrupt is available and >>>>> thus the polling timer is used), always a few seconds after the CAN interfaces are set up. >>>>> >>>>> I have bisected the issue to commit a163c5761019b ("can: m_can: Start/Cancel polling timer together >>>>> with interrupts"). Both master and 6.6 stable (which received a backport of the commit) are >>>>> affected. On 6.6 the commit is easy to revert, but on master a lot has happened on top of that >>>>> change. >>>>> >>>>> As far as I can tell, the reason is that hrtimer_cancel() tries to cancel the timer synchronously, >>>>> which will deadlock when called from the hrtimer callback itself (hrtimer_callback -> m_can_isr -> >>>>> m_can_disable_all_interrupts -> hrtimer_cancel). >>>>> >>>>> I can try to come up with a fix, but I think you are much more familiar with the driver code. Please >>>>> let me know if you need any more information. >>>>> >>>>> Best regards, >>>>> Matthias >>>>> >>>>> >>> >>> >