Hi Alex, > -----Original Message----- > From: Alex Williamson <alex.williamson@xxxxxxxxxx> > Sent: Tuesday, March 26, 2019 10:27 AM > To: Kirti Wankhede <kwankhede@xxxxxxxxxx> > Cc: Parav Pandit <parav@xxxxxxxxxxxx>; kvm@xxxxxxxxxxxxxxx; linux- > kernel@xxxxxxxxxxxxxxx; Neo Jia <cjia@xxxxxxxxxx> > Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence > > On Tue, 26 Mar 2019 12:36:22 +0530 > Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > > > On 3/23/2019 4:50 AM, Parav Pandit wrote: > > > There are five problems with current code structure. > > > 1. mdev device is placed on the mdev bus before it is created in the > > > vendor driver. Once a device is placed on the mdev bus without > > > creating its supporting underlying vendor device, an open() can get > > > triggered by userspace on partially initialized device. > > > Below ladder diagram highlight it. > > > > > > cpu-0 cpu-1 > > > ----- ----- > > > create_store() > > > mdev_create_device() > > > device_register() > > > ... > > > vfio_mdev_probe() > > > ...creates char device > > > vfio_mdev_open() > > > parent->ops->open(mdev) > > > vfio_ap_mdev_open() > > > matrix_mdev = NULL > > > [...] > > > parent->ops->create() > > > vfio_ap_mdev_create() > > > mdev_set_drvdata(mdev, matrix_mdev); > > > /* Valid pointer set above */ > > > > > > > VFIO interface uses sysfs path of device or PCI device's BDF where it > > checks sysfs file for that device exist. > > In case of VFIO mdev device, above situation will never happen as open > > will only get called if sysfs entry for that device exist. > > > > If you don't use VFIO interface then this situation can arise. In that > > case probe() can be used for very basic initialization then create > > actual char device from create(). > > > > > > > 2. Current creation sequence is, > > > parent->ops_create() > > > groups_register() > > > > > > Remove sequence is, > > > parent->ops->remove() > > > groups_unregister() > > > However, remove sequence should be exact mirror of creation sequence. > > > Once this is achieved, all users of the mdev will be terminated > > > first before removing underlying vendor device. > > > (Follow standard linux driver model). > > > At that point vendor's remove() ops shouldn't failed because device > > > is taken off the bus that should terminate the users. > > > > > > > If VMM or user space application is using mdev device, > > parent->ops->remove() can return failure. In that case sysfs files > > shouldn't be removed. Hence above sequence is followed for remove. > > > > Standard linux driver model doesn't allow remove() to fail, but in of > > mdev framework, interface is defined to handle such error case. > > > > > > > 3. Additionally any new mdev driver that wants to work on mdev > > > device during probe() routine registered using > > > mdev_register_driver() needs to get stable mdev structure. > > > > > > > Things that you are trying to handle with mdev structure from probe(), > > couldn't that be moved to create()? > > > > > > > 4. In following sequence, child devices created while removing mdev > > > parent device can be left out, or it may lead to race of removing > > > half initialized child mdev devices. > > > > > > issue-1: > > > -------- > > > cpu-0 cpu-1 > > > ----- ----- > > > mdev_unregister_device() > > > device_for_each_child() > > > mdev_device_remove_cb() > > > mdev_device_remove() > > > create_store() > > > mdev_device_create() [...] > > > device_register() > > > parent_remove_sysfs_files() > > > /* BUG: device added by cpu-0 > > > * whose parent is getting removed. > > > */ > > > > > > issue-2: > > > -------- > > > cpu-0 cpu-1 > > > ----- ----- > > > create_store() > > > mdev_device_create() [...] > > > device_register() > > > > > > [...] mdev_unregister_device() > > > device_for_each_child() > > > mdev_device_remove_cb() > > > mdev_device_remove() > > > > > > mdev_create_sysfs_files() > > > /* BUG: create is adding > > > * sysfs files for a device > > > * which is undergoing removal. > > > */ > > > parent_remove_sysfs_files() > > > > > > 5. Below crash is observed when user initiated remove is in progress > > > and mdev_unregister_driver() completes parent unregistration. > > > > > > cpu-0 cpu-1 > > > ----- ----- > > > remove_store() > > > mdev_device_remove() > > > active = false; > > > mdev_unregister_device() > > > remove type > > > [...] > > > mdev_remove_ops() crashes. > > > > > > This is similar race like create() racing with mdev_unregister_device(). > > > > > > mtty mtty: MDEV: Registered > > > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group > > > 57 vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = > > > 57 mdev_device_remove sleep started mtty mtty: MDEV: Unregistering > > > mtty_dev: Unloaded! > > > BUG: unable to handle kernel paging request at ffffffffc027d668 PGD > > > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0 > > > Oops: 0000 [#1] SMP PTI > > > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted > > > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro > > > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016 > > > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace: > > > mdev_device_remove+0xef/0x130 [mdev] > > > remove_store+0x77/0xa0 [mdev] > > > kernfs_fop_write+0x113/0x1a0 > > > __vfs_write+0x33/0x1b0 > > > ? rcu_read_lock_sched_held+0x64/0x70 > > > ? rcu_sync_lockdep_assert+0x2a/0x50 ? __sb_start_write+0x121/0x1b0 > > > ? vfs_write+0x17c/0x1b0 > > > vfs_write+0xad/0x1b0 > > > ? trace_hardirqs_on_thunk+0x1a/0x1c > > > ksys_write+0x55/0xc0 > > > do_syscall_64+0x5a/0x210 > > > > > > Therefore, mdev core is improved in following ways to overcome above > > > issues. > > > > > > 1. Before placing mdev devices on the bus, perform vendor drivers > > > creation which supports the mdev creation. > > > This ensures that mdev specific all necessary fields are initialized > > > before a given mdev can be accessed by bus driver. > > > > > > 2. During remove flow, first remove the device from the bus. This > > > ensures that any bus specific devices and data is cleared. > > > Once device is taken of the mdev bus, perform remove() of mdev from > > > the vendor driver. > > > > > > > If user space application is using the device and someone underneath > > remove the device from bus, how would use space application know that > > device is being removed? > > If DMA is setup, user space application is accessing that memory and > > device is removed from bus - how will you restrict to not to remove > > that device? If remove() is not restricted then host might crash. > > I know Linux kernel device core model doesn't allow remove() to fail, > > but we had tackled that problem for mdev devices in this framework. I > > prefer not to change this behavior. This will regress existing working > > drivers. > > > We have exactly this issue with vfio-pci, or really any vfio driver, where the > solution is that a remove request is blocked until the device becomes > unused by the user. In fact there's a notification that userspace can connect > to so that we don't need to silently wait for userspace to be done. We could > also potentially kill the userspace application using the device, or if we ever > implemented revoke support for mmaps, we could unmap the device and > the use could handle the SIGBUS. With Parav's suggestion to fix the ordering > such that the device is first removed from the bus, where the blocking > opportunity comes into play, it might be time to let go of this one-off > force/not-force behavior. Thanks, > Yes. I think we should do it. For now (for next few days), I am dropping this particular order fixing patch from the series. >From my last 8th patch, I am keeping only the fix for create/remove race with parent removal along with other fixes and cleanup. Posting the v1 in sometime to make progress on already reviewed parts and part of the 8th patch. I cannot split the remove_common() helper function to a different patch, because remove_cb() will bypass mdev->active check without srcu(). So as individual patch, its not correct behavior. Hence, that small refactor is part of srcu fix.