Hi Saravana, Luca, Nuno, On Tue, 20 Feb 2024 16:37:05 -0800 Saravana Kannan <saravanak@xxxxxxxxxx> wrote: ... > > > > diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c > > index a9a292d6d59b..5c5f808b163e 100644 > > --- a/drivers/of/overlay.c > > +++ b/drivers/of/overlay.c > > @@ -1202,6 +1202,12 @@ int of_overlay_remove(int *ovcs_id) > > goto out; > > } > > > > + /* > > + * Wait for any ongoing device link removals before removing some of > > + * nodes > > + */ > > + device_link_wait_removal(); > > + > > Nuno in his patch[1] had this "wait" happen inside > __of_changeset_entry_destroy(). Which seems to be necessary to not hit > the issue that Luca reported[2] in this patch series. Is there any > problem with doing that? > > Luca for some reason did a unlock/lock(of_mutex) in his test patch and > I don't think that's necessary. I think the unlock/lock in Luca's case and so in Nuno's case is needed. I do the device_link_wait_removal() wihout having the of_mutex locked. Now, suppose I do the device_link_wait_removal() call with the of_mutex locked. The following flow is allowed and a deadlock is present. of_overlay_remove() lock(of_mutex) device_link_wait_removal() And, from the workqueue jobs execution: ... device_put() some_driver->remove() of_overlay_remove() <--- The job will never end. It is waiting for of_mutex. Deadlock A call to of_overlay_remove() from a driver remove() function is perfectly legit. A driver can use some overlays and it is already supported. For instance: https://elixir.bootlin.com/linux/v6.8-rc6/source/drivers/of/unittest.c#L3946 Unlocking/locking the mutex for the device_link_wait_removal() call opens a window with the mutex unlocked. What are the consequences of this mutex unlocked window during this of_overlay_remove() call? > > Can you move this call to where Nuno did it and see if that works for > all of you? > > [1] - https://lore.kernel.org/all/20240205-fix-device-links-overlays-v2-2-5344f8c79d57@xxxxxxxxxx/ > [2] - https://lore.kernel.org/all/20231220181627.341e8789@booty/ > If the unlock/lock can be done, I plan to unlock/call/lock in the beginning of free_overlay_changeset(): --- 8< --- @@ -853,6 +854,14 @@ static void free_overlay_changeset(struct overlay_changeset *ovcs) { int i; + /* + * Wait for any ongoing device link removals before removing some of + * nodes. + */ + mutex_unlock(&of_mutex); + device_link_wait_removal(); + mutex_lock(&of_mutex); + if (ovcs->cset.entries.next) of_changeset_destroy(&ovcs->cset); --- 8< --- I prefer that location (drivers/of/overlay.c) instead of Nuno's one because of the unlock/call/lock need. Nuno's call is done in __of_changeset_entry_destroy() (drivers/of/dynamic.c) IMHO, I think it is easier to maintain with this lock, unlock/call/lock, unlock sequence in the same file (i.e. drivers/of/overlay.c). Didn't test yet this modification as I need to setup one of my boards in the right context to reproduce the issue on my side. Also, I need to take into account some other comments received. Best regards, Hervé