On 04/02/2019 07:09 PM, Alex Williamson wrote:
On Tue, 2 Apr 2019 17:27:49 -0400 Farhan Ali <alifm@xxxxxxxxxxxxx> wrote:On 04/02/2019 02:50 PM, Alex Williamson wrote:On Mon, 1 Apr 2019 15:30:27 -0400 Farhan Ali <alifm@xxxxxxxxxxxxx> wrote:vfio_dev_present() which is the condition to wait_event_interruptible_timeout(), will call vfio_group_get_device and try to acquire the mutex group->device_lock. wait_event_interruptible_timeout() will set the state of the current task to TASK_INTERRUPTIBLE, before doing the condition check. This means that we will try to accquire the mutex while already in a sleeping state. The scheduler warns us by giving the following warning: [ 4050.264464] ------------[ cut here ]------------ [ 4050.264508] do not call blocking ops when !TASK_RUNNING; state=1 set at [<00000000b33c00e2>] prepare_to_wait_event+0x14a/0x188 [ 4050.264529] WARNING: CPU: 12 PID: 35924 at kernel/sched/core.c:6112 __might_sleep+0x76/0x90 .... 4050.264756] Call Trace: [ 4050.264765] ([<000000000017bbaa>] __might_sleep+0x72/0x90) [ 4050.264774] [<0000000000b97edc>] __mutex_lock+0x44/0x8c0 [ 4050.264782] [<0000000000b9878a>] mutex_lock_nested+0x32/0x40 [ 4050.264793] [<000003ff800d7abe>] vfio_group_get_device+0x36/0xa8 [vfio] [ 4050.264803] [<000003ff800d87c0>] vfio_del_group_dev+0x238/0x378 [vfio] [ 4050.264813] [<000003ff8015f67c>] mdev_remove+0x3c/0x68 [mdev] [ 4050.264825] [<00000000008e01b0>] device_release_driver_internal+0x168/0x268 [ 4050.264834] [<00000000008de692>] bus_remove_device+0x162/0x190 [ 4050.264843] [<00000000008daf42>] device_del+0x1e2/0x368 [ 4050.264851] [<00000000008db12c>] device_unregister+0x64/0x88 [ 4050.264862] [<000003ff8015ed84>] mdev_device_remove+0xec/0x130 [mdev] [ 4050.264872] [<000003ff8015f074>] remove_store+0x6c/0xa8 [mdev] [ 4050.264881] [<000000000046f494>] kernfs_fop_write+0x14c/0x1f8 [ 4050.264890] [<00000000003c1530>] __vfs_write+0x38/0x1a8 [ 4050.264899] [<00000000003c187c>] vfs_write+0xb4/0x198 [ 4050.264908] [<00000000003c1af2>] ksys_write+0x5a/0xb0 [ 4050.264916] [<0000000000b9e270>] system_call+0xdc/0x2d8 [ 4050.264925] 4 locks held by sh/35924: [ 4050.264933] #0: 000000001ef90325 (sb_writers#4){.+.+}, at: vfs_write+0x9e/0x198 [ 4050.264948] #1: 000000005c1ab0b3 (&of->mutex){+.+.}, at: kernfs_fop_write+0x1cc/0x1f8 [ 4050.264963] #2: 0000000034831ab8 (kn->count#297){++++}, at: kernfs_remove_self+0x12e/0x150 [ 4050.264979] #3: 00000000e152484f (&dev->mutex){....}, at: device_release_driver_internal+0x5c/0x268 [ 4050.264993] Last Breaking-Event-Address: [ 4050.265002] [<000000000017bbaa>] __might_sleep+0x72/0x90 [ 4050.265010] irq event stamp: 7039 [ 4050.265020] hardirqs last enabled at (7047): [<00000000001cee7a>] console_unlock+0x6d2/0x740 [ 4050.265029] hardirqs last disabled at (7054): [<00000000001ce87e>] console_unlock+0xd6/0x740 [ 4050.265040] softirqs last enabled at (6416): [<0000000000b8fe26>] __udelay+0xb6/0x100 [ 4050.265049] softirqs last disabled at (6415): [<0000000000b8fe06>] __udelay+0x96/0x100 [ 4050.265057] ---[ end trace d04a07d39d99a9f9 ]--- Let's fix this as described in the article https://lwn.net/Articles/628628/. Signed-off-by: Farhan Ali <alifm@xxxxxxxxxxxxx> --- This fixes the warning for me, but would appreciate feedback and some more testing with this patch.Hi Farhan, thanks for the patch. I was able to reproduce the original issue with vfio-pci by enabling CONFIG_DEBUG_ATOMIC_SLEEP=y in my kernel config and attempting to remove the device in the host while assigned to a guest. The patch below resolves that warning, but introduces other poor behavior which you might not notice on an mdev device since vfio_mdev_dev_ops doesn't implement a request callback...Hi Alex, Thanks for your response and testing the patch.drivers/vfio/vfio.c | 28 ++++++++++------------------ 1 file changed, 10 insertions(+), 18 deletions(-) diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index a3030cd..8a696f0 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -922,13 +922,12 @@ static bool vfio_dev_present(struct vfio_group *group, struct device *dev) * removed. Open file descriptors for the device... */ void *vfio_del_group_dev(struct device *dev) { + DEFINE_WAIT_FUNC(wait, woken_wake_function); struct vfio_device *device = dev_get_drvdata(dev); struct vfio_group *group = device->group; void *device_data = device->device_data; struct vfio_unbound_dev *unbound; unsigned int i = 0; - long ret; - bool interrupted = false;/** The group exists so long as we have a device reference. Get @@ -964,6 +963,8 @@ void *vfio_del_group_dev(struct device *dev) * interval with counter to allow the driver to take escalating * measures to release the device if it has the ability to do so. */ + add_wait_queue(&vfio.release_q, &wait); + do { device = vfio_group_get_device(group, dev); if (!device) @@ -974,23 +975,14 @@ void *vfio_del_group_dev(struct device *dev)vfio_device_put(device); - if (interrupted) {- ret = wait_event_timeout(vfio.release_q, - !vfio_dev_present(group, dev), HZ * 10); - } else { - ret = wait_event_interruptible_timeout(vfio.release_q, - !vfio_dev_present(group, dev), HZ * 10); - if (ret == -ERESTARTSYS) { - interrupted = true; - dev_warn(dev, - "Device is currently in use, task" - " \"%s\" (%d) " - "blocked until device is released", - current->comm, task_pid_nr(current)); - } - } - } while (ret <= 0); + if (!vfio_dev_present(group, dev)) + break; + + wait_woken(&wait, TASK_INTERRUPTIBLE, HZ * 10);When this gets interrupted, the interrupting signal is not cleared, so this will return immediately on the next loop. We go from casually bumping the driver to release the device and checking every 10s to burning a full core until the device is released. IIRC, this is partially why the old code switched from the interruptible timeout version of wait_event to simply the timeout version (also I thought it useful to acknowledge the first attempt to interrupt the blocked process to indicate we're not deadlocked, we're waiting for the device). Ideally we'd keep this same behavior, knowing if we've been interrupted to leave a breadcrumb in the log, but we at least need to figure out how to consume or block the signal that generated the interruption so we can go back to waiting gracefully. Thanks, AlexHrmm, we could call the wait_woken to set task state to TASK_UNINTERRUPTIBLE, which would prevent us from getting interrupted. Something like the snippet below: if (interrupted) { wait_woken(&wait, TASK_UNINTERRUPTIBLE, HZ * 10); } else { wait_woken(&wait, TASK_INTERRUPTIBLE, HZ * 10); if (signal_pending(current)) { interrupted = true; } } what do you think?Yeah, that could work. Thanks, Alex
Great, I will send out a v2 soon. Thanks for your feedback. Thanks Farhan