On Thu, 7 Nov 2019 10:08:26 -0600 Parav Pandit <parav@xxxxxxxxxxxx> wrote: I guess that should be s/Improvise/improve/ in $SUBJECT, no? > mdev creation and removal sequence synchronization with parent device > removal is improved in [1]. > > However such improvement using semaphore either limiting or leads to > complex locking scheme when used across multiple subsystem such as mdev > and devlink. > > When mdev devices are used with devlink eswitch device, following > deadlock sequence can be witnessed. > > mlx5_core 0000:06:00.0: E-Switch: Disable: mode(OFFLOADS), nvfs(4), active vports(5) > mlx5_core 0000:06:00.0: MDEV: Unregistering > > WARNING: possible circular locking dependency detected > ------------------------------------------------------ > devlink/42094 is trying to acquire lock: > 00000000eb6fb4c7 (&parent->unreg_sem){++++}, at: mdev_unregister_device+0xf1/0x160 [mdev] > 012but task is already holding lock: > 00000000efcd208e (devlink_mutex){+.+.}, at: devlink_nl_pre_doit+0x1d/0x170 > 012which lock already depends on the new lock. > 012the existing dependency chain (in reverse order) is: > 012-> #1 (devlink_mutex){+.+.}: > lock_acquire+0xbd/0x1a0 > __mutex_lock+0x84/0x8b0 > devlink_unregister+0x17/0x60 > mlx5_sf_unload+0x21/0x60 [mlx5_core] > mdev_remove+0x1e/0x40 [mdev] > device_release_driver_internal+0xdc/0x1a0 > bus_remove_device+0xef/0x160 > device_del+0x163/0x360 > mdev_device_remove_common+0x1e/0xa0 [mdev] > mdev_device_remove+0x8d/0xd0 [mdev] > remove_store+0x71/0x90 [mdev] > kernfs_fop_write+0x113/0x1a0 > vfs_write+0xad/0x1b0 > ksys_write+0x5c/0xd0 > do_syscall_64+0x5a/0x270 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > 012-> #0 (&parent->unreg_sem){++++}: > check_prev_add+0xb0/0x810 > __lock_acquire+0xd4b/0x1090 > lock_acquire+0xbd/0x1a0 > down_write+0x33/0x70 > mdev_unregister_device+0xf1/0x160 [mdev] > esw_offloads_disable+0xe/0x70 [mlx5_core] > mlx5_eswitch_disable+0x149/0x190 [mlx5_core] > mlx5_devlink_eswitch_mode_set+0xd0/0x180 [mlx5_core] > devlink_nl_cmd_eswitch_set_doit+0x3e/0xb0 > genl_family_rcv_msg+0x3a2/0x420 > genl_rcv_msg+0x47/0x90 > netlink_rcv_skb+0xc9/0x100 > genl_rcv+0x24/0x40 > netlink_unicast+0x179/0x220 > netlink_sendmsg+0x2f6/0x3f0 > sock_sendmsg+0x30/0x40 > __sys_sendto+0xdc/0x160 > __x64_sys_sendto+0x24/0x30 > do_syscall_64+0x5a/0x270 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > Possible unsafe locking scenario: > CPU0 CPU1 > ---- ---- > lock(devlink_mutex); > lock(&parent->unreg_sem); > lock(devlink_mutex); > lock(&parent->unreg_sem); > 012 *** DEADLOCK *** > 3 locks held by devlink/42094: > 0: 0000000097a0c4aa (cb_lock){++++}, at: genl_rcv+0x15/0x40 > 1: 00000000baf61ad2 (genl_mutex){+.+.}, at: genl_rcv_msg+0x66/0x90 > 2: 00000000efcd208e (devlink_mutex){+.+.}, at: devlink_nl_pre_doit+0x1d/0x170 > > To summarize, > mdev_remove() > read locks -> unreg_sem [ lock-A ] > [..] > devlink_unregister(); > mutex lock devlink_mutex [ lock-B ] > > devlink eswitch->switchdev-legacy mode change. > devlink_nl_cmd_eswitch_set_doit() > mutex lock devlink_mutex [ lock-B ] > mdev_unregister_device() > write locks -> unreg_sem [ lock-A] So, this problem starts to pop up once you hook up that devlink stuff with the mdev stuff, and previous users of mdev just did not have a locking scheme similar to devlink? > > Hence, instead of using semaphore, such synchronization is achieved > using srcu which is more flexible that eliminates nested locking. > > SRCU based solution is already proposed before at [2]. > > [1] commit 5715c4dd66a3 ("vfio/mdev: Synchronize device create/remove with parent removal") > [2] https://lore.kernel.org/patchwork/patch/1055254/ I don't quite recall the discussion there... is this a rework of a patch you proposed before? Confused. > > Signed-off-by: Parav Pandit <parav@xxxxxxxxxxxx> > --- > drivers/vfio/mdev/mdev_core.c | 56 +++++++++++++++++++++++--------- > drivers/vfio/mdev/mdev_private.h | 3 +- > 2 files changed, 43 insertions(+), 16 deletions(-) (...) > @@ -207,6 +207,7 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops) > dev_warn(dev, "Failed to create compatibility class link\n"); > > list_add(&parent->next, &parent_list); > + rcu_assign_pointer(parent->self, parent); > mutex_unlock(&parent_list_lock); > > dev_info(dev, "MDEV: Registered\n"); > @@ -250,14 +251,29 @@ void mdev_unregister_device(struct device *dev) > list_del(&parent->next); > mutex_unlock(&parent_list_lock); > > - down_write(&parent->unreg_sem); > + /* > + * Publish that this mdev parent is unregistering. So any new > + * create/remove cannot start on this parent anymore by user. > + */ > + rcu_assign_pointer(parent->self, NULL); > + > + /* > + * Wait for any active create() or remove() mdev ops on the parent > + * to complete. > + */ > + synchronize_srcu(&parent->unreg_srcu); > + > + /* > + * At this point it is confirmed that any pending user initiated > + * create or remove callbacks accessing the parent are completed. > + * It is safe to remove the parent now. > + */ So, you're putting an srcu-handled self reference there and use that as an indication whether the parent is unregistering? > > class_compat_remove_link(mdev_bus_compat_class, dev, NULL); > > device_for_each_child(dev, NULL, mdev_device_remove_cb); > > parent_remove_sysfs_files(parent); > - up_write(&parent->unreg_sem); > > mdev_put_parent(parent); >