On Mon, Mar 04, 2024 at 08:10:02PM +0800, Sam Sun wrote: > Dear developers and maintainers, > > We encountered a task hung in function hub_activate(). It was reported > before by Syzbot several years ago > (https://groups.google.com/g/syzkaller-lts-bugs/c/_komEgHj03Y/m/rbcVKyLXBwAJ), > but no repro at that time. We have a C repro this time and kernel > config is attached to this email. The bug report is listed below. Never mind the rest of the kernel log; I figured out what's going on. Here are the important parts: > ppid:8106 flags:0x00000006 > Call Trace: > <TASK> > context_switch kernel/sched/core.c:5376 [inline] > __schedule+0xcea/0x59e0 kernel/sched/core.c:6688 > __schedule_loop kernel/sched/core.c:6763 [inline] > schedule+0xe9/0x270 kernel/sched/core.c:6778 > schedule_preempt_disabled+0x13/0x20 kernel/sched/core.c:6835 > __mutex_lock_common kernel/locking/mutex.c:679 [inline] > __mutex_lock+0x509/0x940 kernel/locking/mutex.c:747 > device_lock include/linux/device.h:992 [inline] > usb_deauthorize_interface+0x4d/0x130 drivers/usb/core/message.c:1789 > interface_authorized_store+0xaf/0x110 drivers/usb/core/sysfs.c:1178 > dev_attr_store+0x54/0x80 drivers/base/core.c:2366 usb_deauthorize_interface() starts by calling device_lock() on the usb_interface's parent usb_device. > ppid:8109 flags:0x00004006 > Call Trace: > <TASK> > context_switch kernel/sched/core.c:5376 [inline] > __schedule+0xcea/0x59e0 kernel/sched/core.c:6688 > __schedule_loop kernel/sched/core.c:6763 [inline] > schedule+0xe9/0x270 kernel/sched/core.c:6778 > kernfs_drain+0x36c/0x550 fs/kernfs/dir.c:505 > __kernfs_remove+0x280/0x650 fs/kernfs/dir.c:1465 > kernfs_remove_by_name_ns+0xb4/0x130 fs/kernfs/dir.c:1673 > kernfs_remove_by_name include/linux/kernfs.h:623 [inline] > remove_files+0x96/0x1c0 fs/sysfs/group.c:28 > sysfs_remove_group+0x8b/0x180 fs/sysfs/group.c:292 > sysfs_remove_groups fs/sysfs/group.c:316 [inline] > sysfs_remove_groups+0x60/0xa0 fs/sysfs/group.c:308 > device_remove_groups drivers/base/core.c:2734 [inline] > device_remove_attrs+0x192/0x290 drivers/base/core.c:2909 > device_del+0x391/0xa30 drivers/base/core.c:3813 > usb_disable_device+0x360/0x7b0 drivers/usb/core/message.c:1416 > usb_set_configuration+0x1243/0x1c40 drivers/usb/core/message.c:2063 > usb_deauthorize_device+0xe4/0x110 drivers/usb/core/hub.c:2638 > authorized_store+0x122/0x140 drivers/usb/core/sysfs.c:747 > dev_attr_store+0x54/0x80 drivers/base/core.c:2366 Among other things, usb_disable_device() calls device_del() for the usb_device's child interfaces. For brevity, let A be the parent usb_device and let B be the child usb_interface. Then in broad terms, we have: CPU 0 CPU 1 ----------------------------- ---------------------------- usb_deauthorize_device(A) device_lock(A) usb_deauthorize_interface(B) usb_set_configuration(A, -1) device_lock(A) usb_disable_device(B) device_del(B) sysfs_remove_group(B, intf_attrs) The problem now is: 1. The kernfs core (kernfs_drain) on CPU 0 can't remove the intf_attrs sysfs attribute group while CPU 1 is in the middle of running a callback routine for one of the attribute files in that group. 2. The callback routine on CPU 1 can't grab A's lock while CPU 0 is holding it. Result: deadlock. This seems to be the only case where an interface sysfs callback routine tries to acquire the parent device's lock. That lock is needed here because when an interface is deauthorized, the kernel has to unbind the driver for that interface -- and binding or unbinding a USB interface driver requires that the parent device's lock be held. Three ideas stand out. First, the device_lock() call should be interruptible, because it is called when a user process writes to the "authorized" attribute file. But that alone won't fix the problem. Second, we could avoid the deadlock by adding a timeout to this device_lock() call. But we probably don't want a deauthorize operation to fail because of a timeout from a contested lock. Third, this must be a generic problem. It will occur any time a sysfs attribute callback tries to lock its device while another process is trying to unregister that device. We faced this sort of problem some years ago when we were worrying about "suicidal" attributes -- ones which would unregister their own devices. I don't remember what the fix was or how it worked. But we need something like it here. Greg and Tejun, any ideas? Is it possible somehow for an attribute file to be removed while its callback is still running? Alan Stern