Hi Alex, I was on travel for last 3 days, hence the slow response. Started working now. Please see inline response below. > -----Original Message----- > From: Alex Williamson <alex.williamson@xxxxxxxxxx> > Sent: Tuesday, May 21, 2019 3:42 AM > To: Parav Pandit <parav@xxxxxxxxxxxx> > Cc: Cornelia Huck <cohuck@xxxxxxxxxx>; kvm@xxxxxxxxxxxxxxx; linux- > kernel@xxxxxxxxxxxxxxx; kwankhede@xxxxxxxxxx; cjia@xxxxxxxxxx > Subject: Re: [PATCHv3 3/3] vfio/mdev: Synchronize device create/remove with > parent removal > > On Mon, 20 May 2019 19:15:15 +0000 > Parav Pandit <parav@xxxxxxxxxxxx> wrote: > > > > -----Original Message----- > > > From: Cornelia Huck <cohuck@xxxxxxxxxx> > > > Sent: Monday, May 20, 2019 6:29 AM > > > To: Parav Pandit <parav@xxxxxxxxxxxx> > > > Cc: kvm@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; > > > kwankhede@xxxxxxxxxx; alex.williamson@xxxxxxxxxx; cjia@xxxxxxxxxx > > > Subject: Re: [PATCHv3 3/3] vfio/mdev: Synchronize device > > > create/remove with parent removal > > > > > > On Fri, 17 May 2019 14:18:26 +0000 > > > Parav Pandit <parav@xxxxxxxxxxxx> wrote: > > > > > > > > > @@ -206,14 +214,27 @@ void mdev_unregister_device(struct > > > > > > device > > > *dev) > > > > > > dev_info(dev, "MDEV: Unregistering\n"); > > > > > > > > > > > > list_del(&parent->next); > > > > > > + mutex_unlock(&parent_list_lock); > > > > > > + > > > > > > + /* Release the initial reference so that new create cannot start > */ > > > > > > + mdev_put_parent(parent); > > > > > > > > > > The comment is confusing: We do drop one reference, but this > > > > > does not imply we're going to 0 (which would be the one thing > > > > > that would block creating new devices). > > > > > > > > > Ok. How about below comment. > > > > /* Balance with initial reference init */ > > > > > > Well, 'release the initial reference' is fine; it's just the second > > > part that is confusing. > > > > > > One thing that continues to irk me (and I'm sorry if I sound like a > > > broken > > > record) is that you give up the initial reference and then continue > > > to use parent. For the more usual semantics of a reference count, > > > that would be a bug (as the structure would be freed if the > > > reference count dropped to zero), even though it is not a bug here. > > > > > Well, refcount cannot drop to zero if user is using it. > > But I understand that mdev_device caches it the parent in it, and hence it > uses it. > > However, mdev_device child devices are terminated first when parent goes > away, ensuring that no more parent user is active. > > So as you mentioned, its not a bug here. > > > > > > > > > > > > + > > > > > > + /* > > > > > > + * Wait for all the create and remove references to drop. > > > > > > + */ > > > > > > + wait_for_completion(&parent->unreg_completion); > > > > > > > > > > It only reaches 0 after this wait. > > > > > > > > > Yes. > > > > > > > > > > + > > > > > > + /* > > > > > > + * New references cannot be taken and all users are done > > > > > > + * using the parent. So it is safe to unregister parent. > > > > > > + */ > > > > > > class_compat_remove_link(mdev_bus_compat_class, dev, > NULL); > > > > > > > > > > > > device_for_each_child(dev, NULL, mdev_device_remove_cb); > > > > > > > > > > > > parent_remove_sysfs_files(parent); > > > > > > - > > > > > > - mutex_unlock(&parent_list_lock); > > > > > > - mdev_put_parent(parent); > > > > > > + kfree(parent); > > > > > > + put_device(dev); > > > > > > } > > > > > > EXPORT_SYMBOL(mdev_unregister_device); > > > > > > > > > > > > @@ -237,10 +258,11 @@ int mdev_device_create(struct kobject > *kobj, > > > > > > struct mdev_parent *parent; > > > > > > struct mdev_type *type = to_mdev_type(kobj); > > > > > > > > > > > > - parent = mdev_get_parent(type->parent); > > > > > > - if (!parent) > > > > > > + if (!mdev_try_get_parent(type->parent)) > > > > > > > > > > If other calls are still running, the refcount won't be 0, and > > > > > this will succeed, even if we really want to get rid of the device. > > > > > > > > > Sure, if other calls are running, refcount won't be 0. Process > > > > creating them > > > will eventually complete, and refcount will drop to zero. > > > > And new processes won't be able to start any more. > > > > So there is no differentiation between 'already in creation stage' > > > > and > > > 'about to start' processes. > > > > > > Does it really make sense to allow creation to start if the parent > > > is going away? > > > > > Its really a small time window, on how we draw the line. > > But it has important note that if user continues to keep creating, removing, > parent is blocked on removal. > > > > > > > > > > > > return -EINVAL; > > > > > > > > > > > > + parent = type->parent; > > > > > > + > > > > > > mutex_lock(&mdev_list_lock); > > > > > > > > > > > > /* Check for duplicate */ > > > > > > @@ -287,6 +309,7 @@ int mdev_device_create(struct kobject > > > > > > *kobj, > > > > > > > > > > > > mdev->active = true; > > > > > > dev_dbg(&mdev->dev, "MDEV: created\n"); > > > > > > + mdev_put_parent(parent); > > > > > > > > > > > > return 0; > > > > > > > > > > > > @@ -306,7 +329,6 @@ int mdev_device_remove(struct device *dev) > > > > > > struct mdev_device *mdev, *tmp; > > > > > > struct mdev_parent *parent; > > > > > > struct mdev_type *type; > > > > > > - int ret; > > > > > > > > > > > > mdev = to_mdev_device(dev); > > > > > > > > > > > > @@ -330,15 +352,17 @@ int mdev_device_remove(struct device > *dev) > > > > > > mutex_unlock(&mdev_list_lock); > > > > > > > > > > > > type = to_mdev_type(mdev->type_kobj); > > > > > > - mdev_remove_sysfs_files(dev, type); > > > > > > - device_del(&mdev->dev); > > > > > > - parent = mdev->parent; > > > > > > - ret = parent->ops->remove(mdev); > > > > > > - if (ret) > > > > > > - dev_err(&mdev->dev, "Remove failed: err=%d\n", ret); > > > > > > + if (!mdev_try_get_parent(type->parent)) { > > > > > > > > > > Same here: Is there really a guarantee that the refcount is 0 > > > > > when the parent is going away? > > > > A WARN_ON after wait_for_completion or in freeing the parent is > > > > good to > > > catch bugs. > > > > > > I'd rather prefer to avoid having to add WARN_ONs :) > > > > > > This looks like it is supposed to be an early exit. > > remove() is doing early exit if it doesn't get reference to its parent. > > mdev_device_remove_common(). > > > > > However, if some > > > other thread does any create or remove operation at the same time, > > > we'll still do the remove, and we still might have have a race > > > window (and this is getting really hard to follow in the code). > > > > > Which part? > > We have only 4 functions to follow, register_device(), unregister_device(), > create() and remove(). > > > > If you meant, two removes racing with each other? > > If so, that is currently guarded using not_so_well_defined active flag. > > I will cleanup that later once this series is done. > > > > > > > > > > > > > > > > > + /* > > > > > > + * Parent unregistration have started. > > > > > > + * No need to remove here. > > > > > > + */ > > > > > > + mutex_unlock(&mdev_list_lock); > > > > > > > > > > Btw., you already unlocked above. > > > > > > > > > You are right. This unlock is wrong. I will revise the patch. > > > > > > > > > > + return -ENODEV; > > > > > > + } > > > > > > > > > > > > - /* Balances with device_initialize() */ > > > > > > - put_device(&mdev->dev); > > > > > > + parent = mdev->parent; > > > > > > + mdev_device_remove_common(mdev); > > > > > > mdev_put_parent(parent); > > > > > > > > > > > > return 0; > > > > > > diff --git a/drivers/vfio/mdev/mdev_private.h > > > > > > b/drivers/vfio/mdev/mdev_private.h > > > > > > index 924ed2274941..55ebab0af7b0 100644 > > > > > > --- a/drivers/vfio/mdev/mdev_private.h > > > > > > +++ b/drivers/vfio/mdev/mdev_private.h > > > > > > @@ -19,7 +19,11 @@ void mdev_bus_unregister(void); struct > > > > > mdev_parent > > > > > > { > > > > > > struct device *dev; > > > > > > const struct mdev_parent_ops *ops; > > > > > > - struct kref ref; > > > > > > + /* Protects unregistration to wait until create/remove > > > > > > + * are completed. > > > > > > + */ > > > > > > + refcount_t refcount; > > > > > > + struct completion unreg_completion; > > > > > > struct list_head next; > > > > > > struct kset *mdev_types_kset; > > > > > > struct list_head type_list; > > > > > > > > > > I think what's really needed is to split up the different needs > > > > > and not overload the 'refcount' concept. > > > > > > > > > Refcount tells that how many active references are present for > > > > this parent > > > device. > > > > Those active reference could be create/remove processes and mdev > > > > core > > > itself. > > > > > > > > So when parent unregisters, mdev module publishes that it is going > > > > away > > > through this refcount. > > > > Hence new users cannot start. > > > > > > But it does not actually do that -- if there are other create/remove > > > operations running, userspace can still trigger a new create/remove. > > > If it triggers enough create/remove processes, it can keep the > > > parent around (even though that really is a pathological case.) > > > > > Yes. I agree that is still possible. And an extra flag can guard it. > > I see it as try_get_parent() can be improved as incremental to implement and > honor that flag. > > Do you want to roll that flag in same patch in v4? > > > > > > > > > > > - If we need to make sure that a reference to the parent is held so > > > > > that the parent may not go away while still in use, we should > > > > > continue to use the kref (in the idiomatic way it is used before this > > > > > patch.) > > > > > - We need to protect against creation of new devices if the parent is > > > > > going away. Maybe set a going_away marker in the parent structure > for > > > > > that so that creation bails out immediately? > > > > Such marker will help to not start new processes. > > > > So an additional marker can be added to improve mdev_try_get_parent(). > > > > But I couldn't justify why differentiating those two users on time > > > > scale is > > > desired. > > > > One reason could be that user continuously tries to create mdev > > > > and > > > parent never gets a chance, to unregister? > > > > I guess, parent will run out mdev devices before this can happen. > > > > > > They can also run remove tasks in parallel (see above). > > > > > Yes, remove() is guarded using active flag. > > > > > > > > > > Additionally a stop marker is needed (counter) to tell that all > > > > users are > > > done accessing it. > > > > Both purposes are served using a refcount scheme. > > > > > > Why not stop new create/remove tasks on remove, and do the final > > > cleanup asynchronously? I think a refcount is fine to track > > > accesses, but not to block new tasks. > > > > > So a new flag will guard new create/remove tasks by enhancing > try_get_parent(). > > I just didn't see it as critical fix, but it's doable. See above. > > > > Async is certainly not a good idea. > > mdev_release_parent() in current code doesn't nothing other than freeing > memory and parent reference. > > It take away the parent from the list early on, which is also wrong, because it > was added to the list at the end. > > Unregister() sequence should be mirror image. > > Parent device files has to be removed before unregister_device() finishes, > because they were added in register_device(). > > Otherwise, parent device_del() might be done, but files are still created > under it. > > > > If we want to keep the memory around of parent, until kref drops, than we > need two refcounts. > > One ensure that create and remove are done using it, other one that ensures > that child are done using it. > > I fail to justify adding complexity of two counters, because such > two_counter_desire hints that somehow child devices may be still active even > after remove() calls are finished. > > And that should not be the case. Unless I miss such case. > > > > > > > > > > > What happens if the > > > > > creation has already started when parent removal kicks in, though? > > > > That particular creation will succeed but newer cannot start, > > > > because > > > mdev_put_parent() is done. > > > > > > > > > Do we need some child list locking and an indication whether a child > > > > > is in progress of being registered/unregistered? > > > > > - We also need to protect against removal of devices while unregister > > > > > is in progress (same mechanism as above?) The second issue you > > > > > describe above should be fixed then if the children keep a reference > > > > > of the parent. > > > > Parent unregistration publishes that its going away first, so no > > > > new device > > > removal from user can start. > > > > > > I don't think that this actually works as intended (see above). > > > > > It does work in most cases. Only if user space is creating hundreds of > processes for creating mdevs, before they actually run out of creating new one. > > But as we talked a flag will guard it. > > > > So if refcount is ok, I can enhance it for flag. > > I agree with Connie's dislike of the refcount, where it seems we're really just > using it as a read-writer lock. So why not simply use a rwsem? The parent > unregistration path would do a down_write() and all the ancillary paths would > do a down_read_trylock() as they should never see read contention unless the > parent is being removed. Ok. sounds good. I will send v4 using rwsem without removing kref. > As a bonus, we don't need to invent our own fairness > algorithm, nor do we need to remove the krefs as they're at least useful to > validate we haven't missed anyone. Thanks, Well if we really care for kref, put on parent kref should be done in mdev_device_release(). I do not see how device can be still using the parent after device_del() is done. Anyways, kref cleanup is different patch in 5.3. > > Alex