Hi Stefan, On Wed, Dec 21, 2022 at 10:31:09PM +0100, Stefan Wahren wrote: > Hi Matthias, > > Am 21.12.22 um 20:02 schrieb Matthias Kaehlcke: > > Hi Stefan, > > > > On Wed, Dec 21, 2022 at 07:00:41PM +0100, Stefan Wahren wrote: > > > I will try to play with lock debugging. > > Thanks, hopefully that can provide some hint. > > > DETECT_HUNG_TASK reveals this in error case: > > [ 243.676253] INFO: task kworker/2:1:44 blocked for more than 122 seconds. > [ 243.676284] Not tainted 6.1.0-00007-g22fada783b9f #32 > [ 243.676294] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [ 243.676303] task:kworker/2:1 state:D stack:0 pid:44 ppid:2 > flags:0x00000000 > [ 243.676329] Workqueue: events onboard_hub_attach_usb_driver > [onboard_usb_hub] > [ 243.676388] __schedule from schedule+0x58/0xf8 > [ 243.676419] schedule from schedule_preempt_disabled+0x1c/0x2c > [ 243.676445] schedule_preempt_disabled from > __mutex_lock.constprop.0+0x29c/0x948 > [ 243.676474] __mutex_lock.constprop.0 from __driver_attach+0x74/0x188 > [ 243.676503] __driver_attach from bus_for_each_dev+0x70/0xb0 > [ 243.676532] bus_for_each_dev from onboard_hub_attach_usb_driver+0xc/0x28 > [onboard_usb_hub] > [ 243.676587] onboard_hub_attach_usb_driver [onboard_usb_hub] from > process_one_work+0x1f8/0x520 > [ 243.676637] process_one_work from worker_thread+0x40/0x55c > [ 243.676663] worker_thread from kthread+0xf0/0x110 > [ 243.676685] kthread from ret_from_fork+0x14/0x2c > [ 243.676705] Exception stack(0xf091dfb0 to 0xf091dff8) > [ 243.676720] dfa0: 00000000 00000000 > 00000000 00000000 > [ 243.676737] dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 > 00000000 00000000 > [ 243.676752] dfe0: 00000000 00000000 00000000 00000000 00000013 00000000 > [ 243.676788] INFO: task systemd-udevd:148 blocked for more than 122 > seconds. > [ 243.676800] Not tainted 6.1.0-00007-g22fada783b9f #32 > [ 243.676809] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [ 243.676817] task:systemd-udevd state:D stack:0 pid:148 ppid:144 > flags:0x00000081 > [ 243.676839] __schedule from schedule+0x58/0xf8 > [ 243.676864] schedule from schedule_timeout+0xb4/0x15c > [ 243.676893] schedule_timeout from __wait_for_common+0xc4/0x228 > [ 243.676922] __wait_for_common from __flush_work+0x1a8/0x360 > [ 243.676949] __flush_work from __cancel_work_timer+0x10c/0x1e4 > [ 243.676975] __cancel_work_timer from onboard_hub_remove+0x28/0xbc > [onboard_usb_hub] > [ 243.677021] onboard_hub_remove [onboard_usb_hub] from > platform_remove+0x20/0x4c > [ 243.677067] platform_remove from > device_release_driver_internal+0x194/0x21c > [ 243.677092] device_release_driver_internal from > bus_remove_device+0xcc/0xf8 > [ 243.677124] bus_remove_device from device_del+0x16c/0x468 > [ 243.677159] device_del from platform_device_del.part.0+0x10/0x74 > [ 243.677187] platform_device_del.part.0 from > platform_device_unregister+0x18/0x24 > [ 243.677216] platform_device_unregister from > of_platform_device_destroy+0x98/0xa8 > [ 243.677249] of_platform_device_destroy from > onboard_hub_destroy_pdevs+0x48/0x6c > [ 243.677287] onboard_hub_destroy_pdevs from hub_disconnect+0x104/0x174 > [ 243.677321] hub_disconnect from usb_unbind_interface+0x78/0x26c > [ 243.677356] usb_unbind_interface from > device_release_driver_internal+0x194/0x21c > [ 243.677388] device_release_driver_internal from > bus_remove_device+0xcc/0xf8 > [ 243.677419] bus_remove_device from device_del+0x16c/0x468 > [ 243.677452] device_del from usb_disable_device+0xcc/0x178 > [ 243.677486] usb_disable_device from usb_set_configuration+0x4ec/0x8d0 > [ 243.677523] usb_set_configuration from usb_unbind_device+0x24/0x7c > [ 243.677560] usb_unbind_device from > device_release_driver_internal+0x194/0x21c > [ 243.677590] device_release_driver_internal from device_reprobe+0x18/0x90 > [ 243.677620] device_reprobe from __usb_bus_reprobe_drivers+0x40/0x6c > [ 243.677648] __usb_bus_reprobe_drivers from bus_for_each_dev+0x70/0xb0 > [ 243.677676] bus_for_each_dev from usb_register_device_driver+0x9c/0xd0 > [ 243.677713] usb_register_device_driver from onboard_hub_init+0x30/0x1000 > [onboard_usb_hub] > [ 243.677765] onboard_hub_init [onboard_usb_hub] from > do_one_initcall+0x40/0x204 > [ 243.677811] do_one_initcall from do_init_module+0x44/0x1d4 > [ 243.677840] do_init_module from sys_finit_module+0xbc/0xf8 > [ 243.677865] sys_finit_module from __sys_trace_return+0x0/0x10 > [ 243.677887] Exception stack(0xf4659fa8 to 0xf4659ff0) > [ 243.677904] 9fa0: bf369800 0051dba8 00000006 b6e438e0 > 00000000 b6e443f4 > [ 243.677921] 9fc0: bf369800 0051dba8 00000000 0000017b 00531658 0051a1dc > 00526398 00000000 > [ 243.677935] 9fe0: befbb160 befbb150 b6e3a9d8 b6f2aae0 Thanks, that's useful! The flow is something like this: - USB root hub is instantiated - core hub driver calls onboard_hub_create_pdevs(), which creates the platform device for the 1st level hub - the platform device is created even though the onboard hub driver hasn't been loaded yet, because onboard_hub_create/destroy_pdevs() is linked into the USB core - 1st level Microchip hubs is probed by the core hub driver - core hub driver calls onboard_hub_create_pdevs(), which creates the platform device for the 2nd level hub - onboard_hub platform driver is registered - platform device for 1st level hub is probed - schedules 'attach' work - platform device for 2nd level hub is probed - schedules 'attach' work - onboard_hub USB driver is registered - device (and parent) lock of Microchip hub is held while the device is re-probing - 'attach' work (running in another thread) calls driver_attach(), which blocks on one of the hub device locks - onboard_hub_destroy_pdevs() is called by the core hub driver when one of the Microchip hubs is detached - destroying the pdevs invokes onboard_hub_remove(), which waits for the 'attach' work to complete - waits forever, since the 'attach' work can't acquire the device lock For the Rpi 3 B Plus and boards with similar configurations it should be enough with not creating the onboard_hub platform devices, which anyway is the right thing do. I'll send patches for this. The above race condition could also impact boards which actually should use the onboard_hub driver, so not creating the pdevs for some boards won't be enough. Out of my head I can't think of a clean solution. The onboard hub driver doesn't control the locks involved. Detaching the driver is necessary to make sure the onboard_hub USB devices don't hold stale pointers to their platform device. Re-attaching is needed because of the detach. One option could be to change the 'attach' work from being a member of struct onboard_hub to a static variable owned by the driver. With that onboard_hub_remove() wouldn't have to wait for the work to finish. An inconvenient is that only one instance of the work can run at any time, which could result in a race condition when multiple onboard hubs are probed closely together. It could happen that the USB device of the 2nd (or 3rd, ...) hub isn't re-attached if it wasn't on the system wide USB 'bus' yet when the 'attach' work of the 1st hub runs. Probably a rare condition within the (as of now) rare scenario of multiple onboard hubs, but it could happen. A mitigation could be to enter a sleepy loop if schedule_work() returns false (work is already running) and schedule it again until it is actually scheduled on behalf of the platform device in question. I might go for that if I don't get a better idea. Happy holidays! m.