On Tue, 2024-03-05 at 11:27 +0100, Herve Codina wrote: > Hi Nuno, Saravana, Rob, > > On Tue, 05 Mar 2024 08:36:45 +0100 > Nuno Sá <noname.nuno@xxxxxxxxx> wrote: > > > On Mon, 2024-03-04 at 22:47 -0800, Saravana Kannan wrote: > > > On Mon, Mar 4, 2024 at 8:49 AM Herve Codina <herve.codina@xxxxxxxxxxx> > > > wrote: > > > > > > > > Hi Rob, > > > > > > > > On Mon, 4 Mar 2024 09:22:02 -0600 > > > > Rob Herring <robh@xxxxxxxxxx> wrote: > > > > > > > > ... > > > > > > > > > > > @@ -853,6 +854,14 @@ static void free_overlay_changeset(struct > > > > > > > overlay_changeset *ovcs) > > > > > > > { > > > > > > > int i; > > > > > > > > > > > > > > + /* > > > > > > > + * Wait for any ongoing device link removals before removing > > > > > > > some of > > > > > > > + * nodes. Drop the global lock while waiting > > > > > > > + */ > > > > > > > + mutex_unlock(&of_mutex); > > > > > > > + device_link_wait_removal(); > > > > > > > + mutex_lock(&of_mutex); > > > > > > > > > > > > I'm still not convinced we need to drop the lock. What happens if > > > > > > someone else > > > > > > grabs the lock while we are in device_link_wait_removal()? Can we > > > > > > guarantee that > > > > > > we can't screw things badly? > > > > > > > > > > It is also just ugly because it's the callers of > > > > > free_overlay_changeset() that hold the lock and now we're releasing it > > > > > behind their back. > > > > > > > > > > As device_link_wait_removal() is called before we touch anything, > > > > > can't > > > > > it be called before we take the lock? And do we need to call it if > > > > > applying the overlay fails? > > > > > > Rob, > > > > > > This[1] scenario Luca reported seems like a reason for the > > > device_link_wait_removal() to be where Herve put it. That example > > > seems reasonable. > > > > > > [1] - https://lore.kernel.org/all/20231220181627.341e8789@booty/ > > > > > > > I'm still not totally convinced about that. Why not putting the check right > > before checking the kref in __of_changeset_entry_destroy(). I'll contradict > > myself a bit because this is just theory but if we look at pci_stop_dev(), > > which > > AFAIU, could be reached from a sysfs write(), we have: > > > > device_release_driver(&dev->dev); > > ... > > of_pci_remove_node(dev); > > of_changeset_revert(np->data); > > of_changeset_destroy(np->data); > > > > So looking at the above we would hit the same issue if we flush the queue in > > free_overlay_changeset() - as the queue won't be flushed at all and we could > > have devlink removal due to device_release_driver(). Right? > > > > Again, completely theoretical but seems like a reasonable one plus I'm not > > understanding the push against having the flush in > > __of_changeset_entry_destroy(). Conceptually, it looks the best place to me > > but > > I may be missing some issue in doing it there? > > Instead of having the wait called in __of_changeset_entry_destroy() and so > called in a loop. I could move this call in the __of_changeset_entry_destroy() > caller (without any of_mutex lock drop). > Oh, good catch! At this point all the devlinks removals (related to the changeset) should have been queued so yes, we should only need to flush once. > So this will look like this: > --- 8< --- > void of_changeset_destroy(struct of_changeset *ocs) > { > struct of_changeset_entry *ce, *cen; > > device_link_wait_removal(); > > list_for_each_entry_safe_reverse(ce, cen, &ocs->entries, node) > __of_changeset_entry_destroy(ce); > } > --- 8< --- > > I already tested on my system and it works correctly with > device_link_wait_removal() only called from of_changeset_destroy() > as proposed. > > Saravana, Nuno, Rob does it seems ok for you ? > It looks good to me... - Nuno Sá >