Re: [RFC v1 1/1] vfio: Fix WARNING "do not call blocking ops when !TASK_RUNNING"

Farhan Ali <alifm@xxxxxxxxxxxxx> · Tue, 2 Apr 2019 19:22:25 -0400

On 04/02/2019 07:09 PM, Alex Williamson wrote:
On Tue, 2 Apr 2019 17:27:49 -0400
Farhan Ali <alifm@xxxxxxxxxxxxx> wrote:

On 04/02/2019 02:50 PM, Alex Williamson wrote:
On Mon,  1 Apr 2019 15:30:27 -0400
Farhan Ali <alifm@xxxxxxxxxxxxx> wrote:
   
vfio_dev_present() which is the condition to
wait_event_interruptible_timeout(), will call vfio_group_get_device
and try to acquire the mutex group->device_lock.

wait_event_interruptible_timeout() will set the state of the current
task to TASK_INTERRUPTIBLE, before doing the condition check. This
means that we will try to accquire the mutex while already in a
sleeping state. The scheduler warns us by giving the following
warning:

[ 4050.264464] ------------[ cut here ]------------
[ 4050.264508] do not call blocking ops when !TASK_RUNNING; state=1 set at [<00000000b33c00e2>] prepare_to_wait_event+0x14a/0x188
[ 4050.264529] WARNING: CPU: 12 PID: 35924 at kernel/sched/core.c:6112 __might_sleep+0x76/0x90
....

   4050.264756] Call Trace:
[ 4050.264765] ([<000000000017bbaa>] __might_sleep+0x72/0x90)
[ 4050.264774]  [<0000000000b97edc>] __mutex_lock+0x44/0x8c0
[ 4050.264782]  [<0000000000b9878a>] mutex_lock_nested+0x32/0x40
[ 4050.264793]  [<000003ff800d7abe>] vfio_group_get_device+0x36/0xa8 [vfio]
[ 4050.264803]  [<000003ff800d87c0>] vfio_del_group_dev+0x238/0x378 [vfio]
[ 4050.264813]  [<000003ff8015f67c>] mdev_remove+0x3c/0x68 [mdev]
[ 4050.264825]  [<00000000008e01b0>] device_release_driver_internal+0x168/0x268
[ 4050.264834]  [<00000000008de692>] bus_remove_device+0x162/0x190
[ 4050.264843]  [<00000000008daf42>] device_del+0x1e2/0x368
[ 4050.264851]  [<00000000008db12c>] device_unregister+0x64/0x88
[ 4050.264862]  [<000003ff8015ed84>] mdev_device_remove+0xec/0x130 [mdev]
[ 4050.264872]  [<000003ff8015f074>] remove_store+0x6c/0xa8 [mdev]
[ 4050.264881]  [<000000000046f494>] kernfs_fop_write+0x14c/0x1f8
[ 4050.264890]  [<00000000003c1530>] __vfs_write+0x38/0x1a8
[ 4050.264899]  [<00000000003c187c>] vfs_write+0xb4/0x198
[ 4050.264908]  [<00000000003c1af2>] ksys_write+0x5a/0xb0
[ 4050.264916]  [<0000000000b9e270>] system_call+0xdc/0x2d8
[ 4050.264925] 4 locks held by sh/35924:
[ 4050.264933]  #0: 000000001ef90325 (sb_writers#4){.+.+}, at: vfs_write+0x9e/0x198
[ 4050.264948]  #1: 000000005c1ab0b3 (&of->mutex){+.+.}, at: kernfs_fop_write+0x1cc/0x1f8
[ 4050.264963]  #2: 0000000034831ab8 (kn->count#297){++++}, at: kernfs_remove_self+0x12e/0x150
[ 4050.264979]  #3: 00000000e152484f (&dev->mutex){....}, at: device_release_driver_internal+0x5c/0x268
[ 4050.264993] Last Breaking-Event-Address:
[ 4050.265002]  [<000000000017bbaa>] __might_sleep+0x72/0x90
[ 4050.265010] irq event stamp: 7039
[ 4050.265020] hardirqs last  enabled at (7047): [<00000000001cee7a>] console_unlock+0x6d2/0x740
[ 4050.265029] hardirqs last disabled at (7054): [<00000000001ce87e>] console_unlock+0xd6/0x740
[ 4050.265040] softirqs last  enabled at (6416): [<0000000000b8fe26>] __udelay+0xb6/0x100
[ 4050.265049] softirqs last disabled at (6415): [<0000000000b8fe06>] __udelay+0x96/0x100
[ 4050.265057] ---[ end trace d04a07d39d99a9f9 ]---

Let's fix this as described in the article https://lwn.net/Articles/628628/.

Signed-off-by: Farhan Ali <alifm@xxxxxxxxxxxxx>
---

This fixes the warning for me, but would appreciate feedback
and some more testing with this patch.

Hi Farhan, thanks for the patch.  I was able to reproduce the original
issue with vfio-pci by enabling CONFIG_DEBUG_ATOMIC_SLEEP=y in my
kernel config and attempting to remove the device in the host while
assigned to a guest.  The patch below resolves that warning, but
introduces other poor behavior which you might not notice on an mdev
device since vfio_mdev_dev_ops doesn't implement a request callback...

Hi Alex,

Thanks for your response and testing the patch.

   
   drivers/vfio/vfio.c | 28 ++++++++++------------------
   1 file changed, 10 insertions(+), 18 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index a3030cd..8a696f0 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -922,13 +922,12 @@ static bool vfio_dev_present(struct vfio_group *group, struct device *dev)
    * removed.  Open file descriptors for the device... */
   void *vfio_del_group_dev(struct device *dev)
   {
+	DEFINE_WAIT_FUNC(wait, woken_wake_function);
   	struct vfio_device *device = dev_get_drvdata(dev);
   	struct vfio_group *group = device->group;
   	void *device_data = device->device_data;
   	struct vfio_unbound_dev *unbound;
   	unsigned int i = 0;
-	long ret;
-	bool interrupted = false;
   
   	/*
   	 * The group exists so long as we have a device reference.  Get
@@ -964,6 +963,8 @@ void *vfio_del_group_dev(struct device *dev)
   	 * interval with counter to allow the driver to take escalating
   	 * measures to release the device if it has the ability to do so.
   	 */
+	add_wait_queue(&vfio.release_q, &wait);
+
   	do {
   		device = vfio_group_get_device(group, dev);
   		if (!device)
@@ -974,23 +975,14 @@ void *vfio_del_group_dev(struct device *dev)
   
   		vfio_device_put(device);
   
-		if (interrupted) {
-			ret = wait_event_timeout(vfio.release_q,
-					!vfio_dev_present(group, dev), HZ * 10);
-		} else {
-			ret = wait_event_interruptible_timeout(vfio.release_q,
-					!vfio_dev_present(group, dev), HZ * 10);
-			if (ret == -ERESTARTSYS) {
-				interrupted = true;
-				dev_warn(dev,
-					 "Device is currently in use, task"
-					 " \"%s\" (%d) "
-					 "blocked until device is released",
-					 current->comm, task_pid_nr(current));
-			}
-		}
-	} while (ret <= 0);
+		if (!vfio_dev_present(group, dev))
+			break;
+
+		wait_woken(&wait, TASK_INTERRUPTIBLE, HZ * 10);

When this gets interrupted, the interrupting signal is not cleared, so
this will return immediately on the next loop.  We go from casually
bumping the driver to release the device and checking every 10s to
burning a full core until the device is released.  IIRC, this is
partially why the old code switched from the interruptible timeout
version of wait_event to simply the timeout version (also I thought it
useful to acknowledge the first attempt to interrupt the blocked
process to indicate we're not deadlocked, we're waiting for the
device).  Ideally we'd keep this same behavior, knowing if we've been
interrupted to leave a breadcrumb in the log, but we at least need to
figure out how to consume or block the signal that generated the
interruption so we can go back to waiting gracefully.  Thanks,

Alex

Hrmm, we could call the wait_woken to set task state to
TASK_UNINTERRUPTIBLE, which would prevent us from getting interrupted.

Something like the snippet below:

if (interrupted) {

    wait_woken(&wait, TASK_UNINTERRUPTIBLE, HZ * 10);
} else {
    wait_woken(&wait, TASK_INTERRUPTIBLE, HZ * 10);
    if (signal_pending(current)) {
        interrupted = true;
    }
}

what do you think?

Yeah, that could work.  Thanks,

Alex



Great, I will send out a v2 soon. Thanks for your feedback.

Thanks
Farhan