Hi David, On Sat, Mar 25, 2023 at 01:40:01PM +0800, David Gow wrote: > On Fri, 24 Mar 2023 at 20:32, Maxime Ripard <maxime@xxxxxxxxxx> wrote: > > > > On Fri, Mar 24, 2023 at 08:11:52AM +0200, Matti Vaittinen wrote: > > > On 3/23/23 18:36, Maxime Ripard wrote: > > > > On Thu, Mar 23, 2023 at 03:02:03PM +0200, Matti Vaittinen wrote: > > > > > On 3/23/23 14:29, Maxime Ripard wrote: > > > > > > On Thu, Mar 23, 2023 at 02:16:52PM +0200, Matti Vaittinen wrote: > > > > > > > > > > > > This is the description of what was happening: > > > > > > https://lore.kernel.org/dri-devel/20221117165311.vovrc7usy4efiytl@houat/ > > > > > > > > > > Thanks Maxime. Do I read this correcty. The devm_ unwinding not being done > > > > > when root_device_register() is used is not because root_device_unregister() > > > > > would not trigger the unwinding - but rather because DRM code on top of this > > > > > device keeps the refcount increased? > > > > > > > > There's a difference of behaviour between a root_device and any device > > > > with a bus: the root_device will only release the devm resources when > > > > it's freed (in device_release), but a bus device will also do it in > > > > device_del (through bus_remove_device() -> device_release_driver() -> > > > > device_release_driver_internal() -> __device_release_driver() -> > > > > device_unbind_cleanup(), which are skipped (in multiple places) if > > > > there's no bus and no driver attached to the device). > > > > > > > > It does affect DRM, but I'm pretty sure it will affect any framework > > > > that deals with device hotplugging by deferring the framework structure > > > > until the last (userspace) user closes its file descriptor. So I'd > > > > assume that v4l2 and cec at least are also affected, and most likely > > > > others. > > > > > > Thanks for the explanation and patience :) > > > > > > > > > > > > If this is the case, then it sounds like a DRM specific issue to me. > > > > > > > > I mean, I guess. One could also argue that it's because IIO doesn't > > > > properly deal with hotplugging. > > > > > > I must say I haven't been testing the IIO registration API. I've only tested > > > the helper API which is not backed up by any "IIO device". (This is fine for > > > the helper because it must by design be cleaned-up only after the > > > IIO-deregistration). > > > > > > After your explanation here, I am not convinced IIO wouldn't see the same > > > issue if I was testing the devm_iio_device_alloc() & co. > > > > It depends really. The issue DRM is trying to solve is that, when a > > device is gone, some application might still have an open FD and could > > still poke into the kernel, while all the resources would have been > > free'd if it was using devm. > > > > So everything is kept around until the last fd is closed, so you still > > have a reference to the device (even though it's been removed from its > > bus) until that time. > > > > It could be possible that IIO just doesn't handle that case at all. I > > guess most of the devices aren't hotpluggable, and there's not much to > > interact with from a userspace PoV iirc, so it might be why. > > > > > > I'm not sure how that helps. Those are > > > > common helpers which should accommodate every framework, > > > > > > Ok. Fair enough. Besides, if the root-device was sufficient - then I would > > > actually not see the need for a helper. People could in that case directly > > > use the root_device_register(). So, if helpers are provided they should be > > > backed up by a device with a bus then. > > > > > > > and your second > > > > patch breaks the kunit tests for DRM anyway. > > > > > > Oh, I must have made an error there. It was supposed to be just a > > > refactoring with no functional changes. Sorry about that. Anyways, that > > > patch can be forgotten as Greg opposes using the platform devices in generic > > > helpers. > > > > > > > > Whether it is a feature or bug is beyond my knowledge. Still, I would > > > > > not say using the root_device_[un]register() in generic code is not > > > > > feasible - unless all other subsytems have similar refcount handling. > > > > > > > > > > Sure thing using root_device_register() root_device_unregister() in DRM does > > > > > not work as such. This, however, does not mean the generic kunit helpers > > > > > should use platform_devices to force unwinding? > > > > > > > > platform_devices were a quick way to get a device that would have a bus > > > > and a driver bound to fall into the right patch above. We probably > > > > shouldn't use platform_devices and a kunit_device sounds like the best > > > > idea, but the test linked in the original mail I pointed you to should > > > > work with whatever we come up with. It works with multiple (platform, > > > > PCI, USB, etc) buses, so the mock we create should behave like their > > > > real world equivalents. > > > > > > Thanks for the patience and the explanation. Now I understand a generic test > > > device needs to sit on a bus. > > > > > > As I said, in my very specific IIO related test the test device does not > > > need a bus. Hence I'll drop the 'generic helpers' from this series. > > > > So, I went around and created a bunch of kunit tests that shows the > > problem without DRM being involved at all. > > > > It does three things: > > > > - It registers a device, attaches a devm action, unregisters the device > > and then checks that the action has ran. > > > > - It registers a device, gets a reference to it, attaches a devm > > action, puts back the reference, unregisters the device and then > > checks that the action has ran. > > > > - It registers a device, gets a reference to it, attaches a devm action > > that will put back the reference, unregisters the device and then > > checks that the action has ran. > > > > And in three cases: first with a root_device, then platform_device, then > > a platform_device that has been bound to a driver. > > > > Once you've applied that patch, you can run it using: > > > > ./tools/testing/kunit/kunit.py run --kunitconfig=drivers/base/test/ devm-inconsistencies > > > > You'll see that only the last case passes all the tests, even though the > > code itself is exactly the same. > > > > This illustrates the problem very nicely, thanks. > > I played around a bit with this, and I'm definitely leaning towards > this being a bug, rather than intentional behaviour. There's actually > an explicit call to devres_release_all() in device_release() which > seems to suggest that this should work: > /* > * Some platform devices are driven without driver attached > * and managed resources may have been acquired. Make sure > * all resources are released. > * > * Drivers still can add resources into device after device > * is deleted but alive, so release devres here to avoid > * possible memory leak. > */ > > My "solution" is just to call devres_release_all() in device_del() as > well, which fixes your tests (and the drm-test-managed one when ported > to use root_device_register() or my kunit_device_register() API[1]). > > --8<-- > diff --git a/drivers/base/core.c b/drivers/base/core.c > index 6878dfcbf0d6..adfec6185014 100644 > --- a/drivers/base/core.c > +++ b/drivers/base/core.c > @@ -3778,6 +3778,17 @@ void device_del(struct device *dev) > device_platform_notify_remove(dev); > device_links_purge(dev); > > + /* > + * If a device does not have a driver attached, we need to clean up any > + * managed resources. We do this in device_release(), but it's never > + * called (and we leak the device) if a managed resource holds a > + * reference to the device. So release all managed resources here, > + * like we do in driver_detach(). We still need to do so again in > + * device_release() in case someone adds a new resource after this > + * point, though. > + */ > + devres_release_all(dev); > + > bus_notify(dev, BUS_NOTIFY_REMOVED_DEVICE); > kobject_uevent(&dev->kobj, KOBJ_REMOVE); > glue_dir = get_glue_dir(dev); > > -->8-- > > It doesn't _seem_ to break anything else, and I've managed to convince > myself that it's probably the correct fix. Yeah, as an outsider, I came to the same conclusion last time... > (Albeit, as someone with a limited knowledge of this part of the code, > who also hasn't had quite enough sleep, so take that with some salt.) ... but as someone that also have a limited knowledge of that part of the code, I certainly wasn't sure it was the proper thing to do :) > Still, I'd agree with Greg that it'd be great to have versions of your > tests upstream before making any such radical changes. I just submitted them. Thanks! Maxime