Hi All, On 5/31/24 12:22 PM, Hans de Goede wrote: > Hi, > > On 5/31/24 11:50 AM, Hans de Goede wrote: >> Hi, >> >> On 5/31/24 10:39 AM, Linux regression tracking (Thorsten Leemhuis) wrote: >>> [adding the LED folks and the regressions list to the list of recipients] >>> >>> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting >>> for once, to make this easily accessible to everyone. >>> >>> Lee, Pavel, could you look into below regression report please? Thread >>> starts here: >>> https://lore.kernel.org/all/9d189ec329cfe68ed68699f314e191a10d4b5eda.camel@xxxxxxxxxxxx/ >>> >>> Another report with somewhat similar symptom can be found here: >>> https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@xxxxxxxxx/ >>> >>> See also Russell's analysis of that report below (many many thx for >>> that, much appreciated Russel!). >>> >>> To my untrained eyes all of this sounds a lot like we still have a 6.9 >>> regression related to the LED code somewhere. Reminder, we had earlier >>> trouble, but that was avoided through other measures: >>> >>> * 3d913719df14c2 ("wifi: iwlwifi: Use request_module_nowait") / >>> https://lore.kernel.org/lkml/30f757e3-73c5-5473-c1f8-328bab98fd7d@xxxxxxxxxxxxxxx/ >>> >>> * c04d1b9ecce565 ("igc: Fix LED-related deadlock on driver unbind") / >>> https://lore.kernel.org/all/ZhRD3cOtz5i-61PB@mail-itl/ >>> >>> * 19fa4f2a85d777 ("r8169: fix LED-related deadlock on module removal") >>> >>> That iwlwifi commit even calls it self "work around". The developer that >>> submitted it bisected the problem to a LED merge, but sadly that was the >>> end of it. :-/ >> >> I actually have been looking at a ledtrig-netdev lockdep warning yesterday >> which I believe is the same thing. I'll include the lockdep trace below. >> >> According to lockdep there indeed is a ABBA (ish) cyclic deadlock with >> the rtnl mutex vs led-triggers related locks. I believe that this problem >> may be a pre-existing problem but this now actually gets hit in kernels >= >> 6.9 because of commit 66601a29bb23 ("leds: class: If no default trigger is >> given, make hw_control trigger the default trigger"). Before that commit >> the "netdev" trigger would not be bound / set as phy LEDs trigger by default. >> >> +Cc Heiner Kallweit who authored that commit. >> >> The netdev trigger typically is not needed because the PHY LEDs are typically >> under hw-control and the netdev trigger even tries to leave things that way >> so setting it as the active trigger for the LED class device is basically >> a no-op. I guess the goal of that commit is correctly have the triggers >> file content reflect that the LED is controlled by a netdev and to allow >> changing the hw-control mode without the user first needing to set netdev >> as trigger before being able to change the mode. >> >> But there is a price to this, besides the locking problem this also >> causes the ledtrig-netdev module to load on pretty much everyones >> systems (when build as a module) even though 99.999% of our users >> likely does not need this at all... >> >> Given this price and the troubles this is causing I think it might be best >> to revert 66601a29bb23. There might still be a locking issue when setting >> the trigger to netdev manually (I'll check and follow up) but this should >> fix the regression users are hitting since typically users do not set >> the trigger manually. > > Ok, I can confirm that the lockdep warning is gone for me with 66601a29bb23 > reverted. Unfortunately it does still happen after a "modprobe ledtrig_netdev" > (to add it to the list of available triggers) and then setting the trigger > for /sys/class/leds/enp42s0-0::lan to netdev manually. > > Still reverting 66601a29bb23 should avoid the problem getting triggered > and this would seem like a safe fix especially for the 6.9 series and > then the necessary time can be taken to fix the actual underlying locking > issue which 66601a29bb23 exposes. I recently wrote a new input-events LED trigger: https://lore.kernel.org/linux-leds/20240531135910.168965-2-hdegoede@xxxxxxxxxx/ and I just found out this has a very similar deadlock problem. It seems there it a generic problem with LEDs or LED triggers getting registered by subsystems while holding a global lock from that subsystem vs the activate / deactivate method of the trigger calling functions of that same subsystem which require that same global lock. I came up with a fix but that just fixed the activate() path leaving a similar deadlock in place in the deactivate path: https://lore.kernel.org/linux-leds/20240601195528.48308-1-hdegoede@xxxxxxxxxx/ https://lore.kernel.org/linux-leds/20240601195528.48308-3-hdegoede@xxxxxxxxxx/ So this seems to be a non trivial problem to solve. For the new input-events trigger I plan to solve this by switching to a single input_handler which will be registered at module_init() time instead of having a handler per LED and registering that handler at activate() time. But I have no idea how to solve this for the netdev trigger I just wanted to share my experience with the input-events trigger in case it is useful. Regards, Hans