Re: [PATCH 10/10] vfio/qat: Add vfio_pci driver for Intel QAT VF devices

Yishai Hadas <yishaih@xxxxxxxxxx> · Sun, 11 Feb 2024 10:17:14 +0200

On 09/02/2024 14:10, Jason Gunthorpe wrote:
On Fri, Feb 09, 2024 at 08:23:32AM +0000, Zeng, Xin wrote:
Thanks for your comments, Jason.
On Tuesday, February 6, 2024 8:55 PM, Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
+
+	ops = mdev->ops;
+	if (!ops || !ops->init || !ops->cleanup ||
+	    !ops->open || !ops->close ||
+	    !ops->save_state || !ops->load_state ||
+	    !ops->suspend || !ops->resume) {
+		ret = -EIO;
+		dev_err(&parent->dev, "Incomplete device migration ops
structure!");
+		goto err_destroy;
+	}

Why are there ops pointers here? I would expect this should just be
direct function calls to the PF QAT driver.

I indeed had a version where the direct function calls are Implemented in
QAT driver, while when I look at the functions, most of them
only translate the interface to the ops pointer. That's why I put
ops pointers directly into vfio variant driver.

But why is there an ops indirection at all? Are there more than one
ops?

+static void qat_vf_pci_aer_reset_done(struct pci_dev *pdev)
+{
+	struct qat_vf_core_device *qat_vdev = qat_vf_drvdata(pdev);
+
+	if (!qat_vdev->core_device.vdev.mig_ops)
+		return;
+
+	/*
+	 * As the higher VFIO layers are holding locks across reset and using
+	 * those same locks with the mm_lock we need to prevent ABBA
deadlock
+	 * with the state_mutex and mm_lock.
+	 * In case the state_mutex was taken already we defer the cleanup work
+	 * to the unlock flow of the other running context.
+	 */
+	spin_lock(&qat_vdev->reset_lock);
+	qat_vdev->deferred_reset = true;
+	if (!mutex_trylock(&qat_vdev->state_mutex)) {
+		spin_unlock(&qat_vdev->reset_lock);
+		return;
+	}
+	spin_unlock(&qat_vdev->reset_lock);
+	qat_vf_state_mutex_unlock(qat_vdev);
+}

Do you really need this? I thought this ugly thing was going to be a
uniquely mlx5 thing..

I think that's still required to make the migration state synchronized
if the VF is reset by other VFIO emulation paths. Is it the case?
BTW, this implementation is not only in mlx5 driver, but also in other
Vfio pci variant drivers such as hisilicon acc driver and pds
driver.

It had to specifically do with the mm lock interaction that, I
thought, was going to be unique to the mlx driver. Otherwise you could
just directly hold the state_mutex here.

Yishai do you remember the exact trace for the mmlock entanglement?

I found in my in-box (from more than 2.5 years ago) the below [1] 
lockdep WARNING when the state_mutex was used directly.

Once we moved to the 'deferred_reset' mode it solved that.

[1]
[  +1.063822] ======================================================
[  +0.000732] WARNING: possible circular locking dependency detected
[  +0.000747] 5.15.0-rc3 #236 Not tainted
[  +0.000556] ------------------------------------------------------
[  +0.000714] qemu-system-x86/7731 is trying to acquire lock:
[  +0.000659] ffff888126c64b78 (&vdev->vma_lock){+.+.}-{3:3}, at: 
vfio_pci_mmap_fault+0x35/0x140 [vfio_pci_core]
[  +0.001127]
              but task is already holding lock:
[  +0.000803] ffff888105f4c5d8 (&mm->mmap_lock#2){++++}-{3:3}, at: 
vaddr_get_pfns+0x64/0x240 [vfio_iommu_type1]
[  +0.001119]
              which lock already depends on the new lock.

[  +0.001086]
              the existing dependency chain (in reverse order) is:
[  +0.000910]
              -> #3 (&mm->mmap_lock#2){++++}-{3:3}:
[  +0.000844]        __might_fault+0x56/0x80
[  +0.000572]        _copy_to_user+0x1e/0x80
[  +0.000556]        mlx5vf_pci_mig_rw.cold+0xa1/0x24f [mlx5_vfio_pci]
[  +0.000732]        vfs_read+0xa8/0x1c0
[  +0.000547]        __x64_sys_pread64+0x8c/0xc0
[  +0.000580]        do_syscall_64+0x3d/0x90
[  +0.000566]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000682]
              -> #2 (&mvdev->state_mutex){+.+.}-{3:3}:
[  +0.000899]        __mutex_lock+0x80/0x9d0
[  +0.000566]        mlx5vf_reset_done+0x2c/0x40 [mlx5_vfio_pci]
[  +0.000697]        vfio_pci_core_ioctl+0x585/0x1020 [vfio_pci_core]
[  +0.000721]        __x64_sys_ioctl+0x436/0x9a0
[  +0.000588]        do_syscall_64+0x3d/0x90
[  +0.000584]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000674]
              -> #1 (&vdev->memory_lock){+.+.}-{3:3}:
[  +0.000843]        down_write+0x38/0x70
[  +0.000544]        vfio_pci_zap_and_down_write_memory_lock+0x1c/0x30 
[vfio_pci_core]
[  +0.003705]        vfio_basic_config_write+0x1e7/0x280 [vfio_pci_core]
[  +0.000744]        vfio_pci_config_rw+0x1b7/0x3af [vfio_pci_core]
[  +0.000716]        vfs_write+0xe6/0x390
[  +0.000539]        __x64_sys_pwrite64+0x8c/0xc0
[  +0.000603]        do_syscall_64+0x3d/0x90
[  +0.000572]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000661]
              -> #0 (&vdev->vma_lock){+.+.}-{3:3}:
[  +0.000828]        __lock_acquire+0x1244/0x2280
[  +0.000596]        lock_acquire+0xc2/0x2c0
[  +0.000556]        __mutex_lock+0x80/0x9d0
[  +0.000580]        vfio_pci_mmap_fault+0x35/0x140 [vfio_pci_core]
[  +0.000709]        __do_fault+0x32/0xa0
[  +0.000556]        __handle_mm_fault+0xbe8/0x1450
[  +0.000606]        handle_mm_fault+0x6c/0x140
[  +0.000624]        fixup_user_fault+0x6b/0x100
[  +0.000600]        vaddr_get_pfns+0x108/0x240 [vfio_iommu_type1]
[  +0.000726]        vfio_pin_pages_remote+0x326/0x460 [vfio_iommu_type1]
[  +0.000736]        vfio_iommu_type1_ioctl+0x43b/0x15a0 [vfio_iommu_type1]
[  +0.000752]        __x64_sys_ioctl+0x436/0x9a0
[  +0.000588]        do_syscall_64+0x3d/0x90
[  +0.000571]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000677]
              other info that might help us debug this:

[  +0.001073] Chain exists of:
                &vdev->vma_lock --> &mvdev->state_mutex --> 
&mm->mmap_lock#2

[  +0.001285]  Possible unsafe locking scenario:

[  +0.000808]        CPU0                    CPU1
[  +0.000599]        ----                    ----
[  +0.000593]   lock(&mm->mmap_lock#2);
[  +0.000530]                                lock(&mvdev->state_mutex);
[  +0.000725]                                lock(&mm->mmap_lock#2);
[  +0.000712]   lock(&vdev->vma_lock);
[  +0.000532]
               *** DEADLOCK ***

[  +0.000922] 2 locks held by qemu-system-x86/7731:
[  +0.000597]  #0: ffff88810c9bec88 (&iommu->lock#2){+.+.}-{3:3}, at: 
vfio_iommu_type1_ioctl+0x189/0x15a0 [vfio_iommu_type1]
[  +0.001177]  #1: ffff888105f4c5d8 (&mm->mmap_lock#2){++++}-{3:3}, at: 
vaddr_get_pfns+0x64/0x240 [vfio_iommu_type1]
[  +0.001153]
              stack backtrace:
[  +0.000689] CPU: 1 PID: 7731 Comm: qemu-system-x86 Not tainted 
5.15.0-rc3 #236
[  +0.000932] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[  +0.001192] Call Trace:
[  +0.000454]  dump_stack_lvl+0x45/0x59
[  +0.000527]  check_noncircular+0xf2/0x110
[  +0.000557]  __lock_acquire+0x1244/0x2280
[  +0.002462]  lock_acquire+0xc2/0x2c0
[  +0.000534]  ? vfio_pci_mmap_fault+0x35/0x140 [vfio_pci_core]
[  +0.000684]  ? lock_is_held_type+0x98/0x110
[  +0.000565]  __mutex_lock+0x80/0x9d0
[  +0.000542]  ? vfio_pci_mmap_fault+0x35/0x140 [vfio_pci_core]
[  +0.000676]  ? vfio_pci_mmap_fault+0x35/0x140 [vfio_pci_core]
[  +0.000681]  ? mark_held_locks+0x49/0x70
[  +0.000551]  ? lock_is_held_type+0x98/0x110
[  +0.000575]  vfio_pci_mmap_fault+0x35/0x140 [vfio_pci_core]
[  +0.000679]  __do_fault+0x32/0xa0
[  +0.000700]  __handle_mm_fault+0xbe8/0x1450
[  +0.000571]  handle_mm_fault+0x6c/0x140
[  +0.000564]  fixup_user_fault+0x6b/0x100
[  +0.000556]  vaddr_get_pfns+0x108/0x240 [vfio_iommu_type1]
[  +0.000666]  ? lock_is_held_type+0x98/0x110
[  +0.000572]  vfio_pin_pages_remote+0x326/0x460 [vfio_iommu_type1]
[  +0.000710]  vfio_iommu_type1_ioctl+0x43b/0x15a0 [vfio_iommu_type1]
[  +0.001050]  ? find_held_lock+0x2b/0x80
[  +0.000561]  ? lock_release+0xc2/0x2a0
[  +0.000537]  ? __fget_files+0xdc/0x1d0
[  +0.000541]  __x64_sys_ioctl+0x436/0x9a0
[  +0.000556]  ? lockdep_hardirqs_on_prepare+0xd4/0x170
[  +0.000641]  do_syscall_64+0x3d/0x90
[  +0.000529]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000670] RIP: 0033:0x7f54255bb45b
[  +0.000541] Code: 0f 1e fa 48 8b 05 2d aa 2c 00 64 c7 00 26 00 00 00 
48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fd a9 2c 00 f7 d8 64 89 01 48
[  +0.001843] RSP: 002b:00007f5415ca3d38 EFLAGS: 00000206 ORIG_RAX: 
0000000000000010
[  +0.000929] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 
00007f54255bb45b
[  +0.000781] RDX: 00007f5415ca3d70 RSI: 0000000000003b71 RDI: 
0000000000000012
[  +0.000775] RBP: 00007f5415ca3da0 R08: 0000000000000000 R09: 
0000000000000000
[  +0.000777] R10: 00000000fe002000 R11: 0000000000000206 R12: 
0000000000000004
[  +0.000775] R13: 0000000000000000 R14: 00000000fe000000 R15: 
00007f5415ca47c0

Yishai