On Wed, May 10, 2023 at 11:51:31AM -0700, Stephen Boyd wrote: > Quoting Dmitry Torokhov (2023-05-05 17:06:07) > > On Fri, May 05, 2023 at 04:24:16PM -0700, Stephen Boyd wrote: > > > > > ... > > > Unfortunately, the hid google hammer driver hand rolls a devm function > > > to call hid_hw_stop() when the driver is unbound and implements an > > > hid_driver::remove() function. The driver core doesn't call the devm > > > release functions until _after_ the bus unbinds the driver, so the order > > > of operations is like this: > > > > Excellent analysis, but the problem is not limited to the hammer driver > > (potentially) and shalt be dealt with appropriately, at the HID bus > > level. > > Thanks. I thought of the bus level approach as well, but I was trying to > keep the fix isolated to the driver that had the problem. I'd like to > get the fix into the stable kernel, as this fixes a regression > introduced by commit d950db3f80a8 ("HID: google: switch to devm when > registering keyboard backlight LED") in v5.18. > > Is the bus level approach going to be acceptable as a stable backport? Sure, why not given the kind of stuff flowing into stable kernels. At least this would be fixing real issue that can be triggered with a real device. > > Is it a problem to call hid_hw_stop() directly? I suppose for the > hid-google-hammer driver we don't want to leave the led sysfs node > hanging around after the hid_hw_stop() function has been called either, > so some sort of forced ejection of the devm led device is needed and the > bus level approach helps there. > > I was curious to see if anything else had this problem so I did this > poor grep to find code that isn't calling hid_hw_stop() from probe or > remove: > > git grep -W hid_hw_stop | grep .c= | grep -v probe | grep -v remove > > and I got this list (minus hid core which doesn't matter): > > drivers/hid/hid-google-hammer.c=static void hammer_stop(void *hdev) > drivers/hid/hid-mcp2221.c=static void mcp2221_hid_unregister(void *ptr) > drivers/hid/hid-wiimote-core.c=static void wiimote_destroy(struct > wiimote_data *wdata) > drivers/hid/wacom_sys.c=static int wacom_parse_and_register(struct > wacom *wacom, bool wireless) > drivers/hid/wacom_sys.c=static void wacom_wireless_work(struct > work_struct *work) > drivers/hid/wacom_sys.c=static void wacom_mode_change_work(struct > work_struct *work) > > The wacom_sys.c ones look OK because they're during workqueues that are > probably flushed, and wiimote_destroy() is called from an error path or > driver remove, so it is also OK. But mcp2221_hid_unregister() has the > same problem. > > If you look at drivers/hid/hid-mcp2221.c you'll see this comment above > mcp2221_remove() too: > > /* This is needed to be sure hid_hw_stop() isn't called twice by the > subsystem */ > static void mcp2221_remove(struct hid_device *hdev) > > which is kinda weird. Why can't we have a devm_hid_hw_start() API that > tells the hid bus to not call hid_hw_stop() at all in > hid_device_remove()? That would allow us to avoid this pitfall where > everything is moved to devm and the driver has no remove function at all > and we forget to populate an empty one. Instead, the bus layer can know > that hardware will be stopped with devm later. So yes, this is another option: all bus code should exclusively use devm* API and can not use non-managed resources. This for HID includes disconnecting hiddev, hidraw and hidinput handlers/drivers. FTR, I think having devm_hid_hw_start() would be nice. > > > > > Actually, it is not even limited to HID, but exists in most buses with > > non-trivial ->remove() implementation. For example I fixed similar issue > > in I2C in 5b5475826c52 ("i2c: ensure timely release of driver-allocated > > resources"). I tried fixing it in SPI but Mark has some objections, and > > wanted to fix it in the driver core, so I was thinking about it and then > > dropped the ball. At this time I do not think fixing it at driver core > > makes logic any clearer, so I think we just need to fix a handful of > > buses. > > Do you have a link to that discussion? https://lore.kernel.org/lkml/YFf2RD931nq3RudJ@xxxxxxxxxx/ > > ------- > > This got me thinking that maybe both of these approaches are wrong. > Maybe the call to hid_close_report() should be removed from > hid_device_remove() instead. > > The device is being removed from the bus when hid_device_remove() is > called, but it hasn't been released yet. Other devices like the hidinput > device are referencing the hdev device because they set the hdev as > their parent. Basically, child devices are still bound to some sort of > driver or subsystem when the parent hdev is unbound from its driver, > leading to a state where the child drivers could still access the hdev > while it is being destroyed. If we remove the hid_close_report() call > from this function it will eventually be called by hid_device_release() > when the last reference to the device is dropped, i.e. when the child > devices all get destroyed. In the case of hid-google-hammer, that would > be when hid_hw_stop() is called from the devm release function by driver > core. > > The benefit of this approach is that we don't allocate a devres group > for all the hid devices when only two drivers need it. The possible > downside is that we keep the report around while the device exists but > has no driver bound to it. > > Here's a totally untested patch for that. > > ---8<---- > diff --git a/drivers/hid/hid-core.c b/drivers/hid/hid-core.c > index 22623eb4f72f..93905e200cae 100644 > --- a/drivers/hid/hid-core.c > +++ b/drivers/hid/hid-core.c > @@ -1211,8 +1211,8 @@ int hid_open_report(struct hid_device *device) > hid_parser_reserved > }; > > - if (WARN_ON(device->status & HID_STAT_PARSED)) > - return -EBUSY; > + if (device->status & HID_STAT_PARSED) > + hid_close_report(device); > > start = device->dev_rdesc; > if (WARN_ON(!start)) > @@ -2662,7 +2662,6 @@ static void hid_device_remove(struct device *dev) > hdrv->remove(hdev); > else /* default remove */ > hid_hw_stop(hdev); > - hid_close_report(hdev); > hdev->driver = NULL; > } This will probably work, but it I consider this still being fragile as at some point we might want to add some more unwinding, and we'll run into this issue again. I would feel much safer if the order of release followed (inversely) order of allocations more closely. Thanks. -- Dmitry