Hello Maxime, On 11/17/22 17:53, Maxime Ripard wrote: > Hi, > > After trying to get more kunit tests for KMS, I found out that the > recent kunit helpers we merged to create a DRM device [1] are broken and > won't free their device-managed and DRM-managed resources. > > With some help from Thomas, we've dug into this and it turns out that if > we allocate a device with root_device_register, initialise our drm > device with devm_drm_dev_alloc(), register it using drm_dev_register(), > unregister it using drm_dev_unregister/drm_dev_unplug and then remove > the parent device, neither the device managed nor the DRM managed > actions are run. > > root_device_register initializes the device by eventually calling > device_initialize() which sets the initial reference count of the root > device to 1 [2]. devm_drm_dev_alloc() then comes in, drm_dev_init() will > increase the root device refcount [3] and initialize our DRM device to 1 > [4]. drm_dev_register(), through drm_minor_register() and device_add(), > will increase the root device refcount [5]. > > When unrolling things, drm_dev_unregister(), through > drm_minor_unregister() and device_del(), will give up its reference [6]. > root_device_unregister(), through device_unregister(), will also give up > its own [7]. > > So we end up with this for the reference counts: > > +------------------------+-------------+------------+ > | funcs | root device | DRM device | > +------------------------+-------------+------------+ > | root_device_register | 1 | N/A | > | devm_drm_dev_alloc | 2 | 1 | > | drm_dev_register | 3 | 1 | > | drm_dev_unregister | 2 | 1 | > | root_device_unregister | 1 | 1 | > +------------------------+-------------+------------+ > > If we go back to the list of reference taken, the root device reference > and the initial drm_device reference, both taken by devm_drm_dev_alloc() > through drm_dev_init(), haven't been put back. > > If we look at the drm_dev_init code(), we can see that it sets up a > DRM-managed action [8] that will put back the device reference [9]. The > DRM-managed code is executed by the drm_managed_cleanup() function, that > is executed as part of a release hook [10] executed once we give up the > final reference to the DRM device [11]. > > If we go back a little, the final reference to the DRM device is > actually the initial one setup by devm_drm_dev_alloc(). This function > has superseded drm_dev_alloc(), with the documentation that we do need a > final drm_dev_put() to put back our final reference [12]. > > devm_drm_dev_alloc() is a more convenient variant that has been > introduced explicitly to not require that drm_dev_put(), and states it > as such in the documentation [13]. It does so by adding a device-managed > action that will call drm_dev_put() [14]. > > Device-managed actions are ran as part devres_release_all() that is > called by device_release() [15], itself being run when the last > reference on the device is put back [16][17][18]. > Thanks a lot for the write up. It was very informative and detailed. > So if we sum things up, the DRM device will only give its last root > device reference when the last DRM device reference will be put back, > and the last DRM device reference will be put back when the last device > reference will be put back, which sounds very circular to me, with both > ending up in a deadlock scenario. > So my conclusion after looking at this is the same than your, that the drivers would need to signal somehow to the DRM core when a DRM device won't be used anymore and drop the final reference to the DRM device. That is, I don't think we can get away of drivers not calling either drm_dev_put(). I think that we should try to simplify the DRM register and release API and make very clear in the documentation what should be used. Right now for example as you mentioned we have both drm_dev_unregister() and drm_dev_unplug() but AFAICT the only difference is that the latter does a sync to protect critical sections during drm_dev_{enter,exit}(). The drawback is that the DRM device will be marked as unplugged before drm_atomic_helper_shutdown(), but is this really a problem in practice? Maybe we can just rename drm_dev_unplug() to drm_dev_unregister() and drm_dev_unregister() to __drm_dev_unregister(). That way, the register path could always be: devm_drm_dev_alloc() drm_dev_register() and then in the release path: drm_dev_unregister() drm_dev_put() making both DRM-managed and device-managed resources to always work. > I've added two kunit tests that demonstrate the issue: we register a > device, allocate and register a DRM device, register a DRM-managed > action, remove the DRM device and the parent device, and wait for the > action to execute. drm_register_unregister_with_devm_test() uses the > broken(?) devm_drm_dev_alloc and is failing. > drm_register_unregister_test uses the deprecated drm_dev_alloc() that > requires an explicit call to drm_dev_put() which works fine. > Great. In my opinion we should add these Kunit tests even when they are exposing an issue in the devm_drm_dev_alloc() helper. > It's also worth noting that Thomas tested with simpledrm and it seems to > work fine. Using a platform_device instead of the root_device doesn't > change anything to the outcome in my tests, so there might be a more > subtle behaviour involved. > That's strange because AFAICT simpledrm is basically doing the same than your failing tests. I tried to look at the differences but couldn't spot anything evident... > Thanks, > Maxime > -- Best regards, Javier Martinez Canillas Core Platforms Red Hat