Hi, Javier and I looked into it some more today, and I think we have a better idea of what is going on. On Thu, Nov 17, 2022 at 05:53:11PM +0100, Maxime Ripard wrote: > After trying to get more kunit tests for KMS, I found out that the > recent kunit helpers we merged to create a DRM device [1] are broken and > won't free their device-managed and DRM-managed resources. > > With some help from Thomas, we've dug into this and it turns out that if > we allocate a device with root_device_register, initialise our drm > device with devm_drm_dev_alloc(), register it using drm_dev_register(), > unregister it using drm_dev_unregister/drm_dev_unplug and then remove > the parent device, neither the device managed nor the DRM managed > actions are run. > > root_device_register initializes the device by eventually calling > device_initialize() which sets the initial reference count of the root > device to 1 [2]. devm_drm_dev_alloc() then comes in, drm_dev_init() will > increase the root device refcount [3] and initialize our DRM device to 1 > [4]. drm_dev_register(), through drm_minor_register() and device_add(), > will increase the root device refcount [5]. > > When unrolling things, drm_dev_unregister(), through > drm_minor_unregister() and device_del(), will give up its reference [6]. > root_device_unregister(), through device_unregister(), will also give up > its own [7]. > > So we end up with this for the reference counts: > > +------------------------+-------------+------------+ > | funcs | root device | DRM device | > +------------------------+-------------+------------+ > | root_device_register | 1 | N/A | > | devm_drm_dev_alloc | 2 | 1 | > | drm_dev_register | 3 | 1 | > | drm_dev_unregister | 2 | 1 | > | root_device_unregister | 1 | 1 | > +------------------------+-------------+------------+ > > If we go back to the list of reference taken, the root device reference > and the initial drm_device reference, both taken by devm_drm_dev_alloc() > through drm_dev_init(), haven't been put back. > > If we look at the drm_dev_init code(), we can see that it sets up a > DRM-managed action [8] that will put back the device reference [9]. The > DRM-managed code is executed by the drm_managed_cleanup() function, that > is executed as part of a release hook [10] executed once we give up the > final reference to the DRM device [11]. > > If we go back a little, the final reference to the DRM device is > actually the initial one setup by devm_drm_dev_alloc(). This function > has superseded drm_dev_alloc(), with the documentation that we do need a > final drm_dev_put() to put back our final reference [12]. > > devm_drm_dev_alloc() is a more convenient variant that has been > introduced explicitly to not require that drm_dev_put(), and states it > as such in the documentation [13]. It does so by adding a device-managed > action that will call drm_dev_put() [14]. > > Device-managed actions are ran as part devres_release_all() that is > called by device_release() [15], itself being run when the last > reference on the device is put back [16][17][18]. > > So if we sum things up, the DRM device will only give its last root > device reference when the last DRM device reference will be put back, > and the last DRM device reference will be put back when the last device > reference will be put back, which sounds very circular to me, with both > ending up in a deadlock scenario. > > I've added two kunit tests that demonstrate the issue: we register a > device, allocate and register a DRM device, register a DRM-managed > action, remove the DRM device and the parent device, and wait for the > action to execute. drm_register_unregister_with_devm_test() uses the > broken(?) devm_drm_dev_alloc and is failing. > drm_register_unregister_test uses the deprecated drm_dev_alloc() that > requires an explicit call to drm_dev_put() which works fine. > > It's also worth noting that Thomas tested with simpledrm and it seems to > work fine. Indeed, the transition from simpledrm to a full-blown DRM driver handles this properly. It's using a platform_device_unregister() [1] and eventually device_del() [2][3]. That part is handled just like root_device_unregister() [4][5]. Basically, both will call device_del(), and then device_put(), so device_del() is called while holding a reference. As we've seen before, at this point the DRM driver still holds a reference to the device as well. device_del() will call bus_remove_device() [6], which will be skipped for the root device since it doesn't have a bus [7]. It will then call device_release_driver() [8], which is basically forwarded to __device_release_driver() [9][10], that will call device_unbind_cleanup() [11]. device_unbind_cleanup() calls devres_release_all() directly [12], that runs all the device-managed actions [13]. And it does so WHILE THERE'S STILL A REFCOUNT OF 2! I would expect the call to devres_release_all to happen only in device_release, once all the device reference have been put back. Not 4 functions in as a side effect, and while there's still some active references. > Using a platform_device instead of the root_device doesn't > change anything to the outcome in my tests, so there might be a more > subtle behaviour involved. This one has the same symptom but a different cause. I was just registering a platform_device but it wasn't bound to any driver. While it's valid to do so according to that comment [13], it doesn't have any driver so the check for the driver in device_release_driver() [8], and never hits device_unbind_cleanup(). Thanks again to Thomas and Javier for their help Maxime 1: https://elixir.bootlin.com/linux/latest/source/drivers/video/aperture.c#L199 2: https://elixir.bootlin.com/linux/latest/source/drivers/base/platform.c#L793 3: https://elixir.bootlin.com/linux/latest/source/drivers/base/platform.c#L751 4: https://elixir.bootlin.com/linux/latest/source/drivers/base/core.c#L4153 5: https://elixir.bootlin.com/linux/latest/source/drivers/base/core.c#L3733 6: https://elixir.bootlin.com/linux/latest/source/drivers/base/core.c#L3704 7: https://elixir.bootlin.com/linux/latest/source/drivers/base/bus.c#L511 8: https://elixir.bootlin.com/linux/latest/source/drivers/base/bus.c#L529 9: https://elixir.bootlin.com/linux/latest/source/drivers/base/dd.c#L1298 10: https://elixir.bootlin.com/linux/latest/source/drivers/base/dd.c#L1275 11: https://elixir.bootlin.com/linux/latest/source/drivers/base/dd.c#L1255 12: https://elixir.bootlin.com/linux/latest/source/drivers/base/dd.c#L530 13: https://elixir.bootlin.com/linux/latest/source/drivers/base/core.c#L2312
Attachment:
signature.asc
Description: PGP signature