RE: [PATCH 10/10] vfio/qat: Add vfio_pci driver for Intel QAT VF devices

"Zeng, Xin" <xin.zeng@xxxxxxxxx> · Sat, 17 Feb 2024 16:20:20 +0000

On Sunday, February 11, 2024 4:17 PM, Yishai Hadas <yishaih@xxxxxxxxxx> wrote:
> >>>> +static void qat_vf_pci_aer_reset_done(struct pci_dev *pdev)
> >>>> +{
> >>>> +	struct qat_vf_core_device *qat_vdev = qat_vf_drvdata(pdev);
> >>>> +
> >>>> +	if (!qat_vdev->core_device.vdev.mig_ops)
> >>>> +		return;
> >>>> +
> >>>> +	/*
> >>>> +	 * As the higher VFIO layers are holding locks across reset and using
> >>>> +	 * those same locks with the mm_lock we need to prevent ABBA
> >>> deadlock
> >>>> +	 * with the state_mutex and mm_lock.
> >>>> +	 * In case the state_mutex was taken already we defer the cleanup work
> >>>> +	 * to the unlock flow of the other running context.
> >>>> +	 */
> >>>> +	spin_lock(&qat_vdev->reset_lock);
> >>>> +	qat_vdev->deferred_reset = true;
> >>>> +	if (!mutex_trylock(&qat_vdev->state_mutex)) {
> >>>> +		spin_unlock(&qat_vdev->reset_lock);
> >>>> +		return;
> >>>> +	}
> >>>> +	spin_unlock(&qat_vdev->reset_lock);
> >>>> +	qat_vf_state_mutex_unlock(qat_vdev);
> >>>> +}
> >>>
> >>> Do you really need this? I thought this ugly thing was going to be a
> >>> uniquely mlx5 thing..
> >>
> >> I think that's still required to make the migration state synchronized
> >> if the VF is reset by other VFIO emulation paths. Is it the case?
> >> BTW, this implementation is not only in mlx5 driver, but also in other
> >> Vfio pci variant drivers such as hisilicon acc driver and pds
> >> driver.
> >
> > It had to specifically do with the mm lock interaction that, I
> > thought, was going to be unique to the mlx driver. Otherwise you could
> > just directly hold the state_mutex here.
> >
> > Yishai do you remember the exact trace for the mmlock entanglement?
> >
> 
> I found in my in-box (from more than 2.5 years ago) the below [1]
> lockdep WARNING when the state_mutex was used directly.
> 
> Once we moved to the 'deferred_reset' mode it solved that.
> 
> [1]
> [  +1.063822] ======================================================
> [  +0.000732] WARNING: possible circular locking dependency detected
> [  +0.000747] 5.15.0-rc3 #236 Not tainted
> [  +0.000556] ------------------------------------------------------
> [  +0.000714] qemu-system-x86/7731 is trying to acquire lock:
> [  +0.000659] ffff888126c64b78 (&vdev->vma_lock){+.+.}-{3:3}, at:
> vfio_pci_mmap_fault+0x35/0x140 [vfio_pci_core]
> [  +0.001127]
>                but task is already holding lock:
> [  +0.000803] ffff888105f4c5d8 (&mm->mmap_lock#2){++++}-{3:3}, at:
> vaddr_get_pfns+0x64/0x240 [vfio_iommu_type1]
> [  +0.001119]
>                which lock already depends on the new lock.
> 
> [  +0.001086]
>                the existing dependency chain (in reverse order) is:
> [  +0.000910]
>                -> #3 (&mm->mmap_lock#2){++++}-{3:3}:
> [  +0.000844]        __might_fault+0x56/0x80
> [  +0.000572]        _copy_to_user+0x1e/0x80
> [  +0.000556]        mlx5vf_pci_mig_rw.cold+0xa1/0x24f [mlx5_vfio_pci]
> [  +0.000732]        vfs_read+0xa8/0x1c0
> [  +0.000547]        __x64_sys_pread64+0x8c/0xc0
> [  +0.000580]        do_syscall_64+0x3d/0x90
> [  +0.000566]        entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  +0.000682]
>                -> #2 (&mvdev->state_mutex){+.+.}-{3:3}:
> [  +0.000899]        __mutex_lock+0x80/0x9d0
> [  +0.000566]        mlx5vf_reset_done+0x2c/0x40 [mlx5_vfio_pci]
> [  +0.000697]        vfio_pci_core_ioctl+0x585/0x1020 [vfio_pci_core]
> [  +0.000721]        __x64_sys_ioctl+0x436/0x9a0
> [  +0.000588]        do_syscall_64+0x3d/0x90
> [  +0.000584]        entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  +0.000674]
>                -> #1 (&vdev->memory_lock){+.+.}-{3:3}:
> [  +0.000843]        down_write+0x38/0x70
> [  +0.000544]        vfio_pci_zap_and_down_write_memory_lock+0x1c/0x30
> [vfio_pci_core]
> [  +0.003705]        vfio_basic_config_write+0x1e7/0x280 [vfio_pci_core]
> [  +0.000744]        vfio_pci_config_rw+0x1b7/0x3af [vfio_pci_core]
> [  +0.000716]        vfs_write+0xe6/0x390
> [  +0.000539]        __x64_sys_pwrite64+0x8c/0xc0
> [  +0.000603]        do_syscall_64+0x3d/0x90
> [  +0.000572]        entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  +0.000661]
>                -> #0 (&vdev->vma_lock){+.+.}-{3:3}:
> [  +0.000828]        __lock_acquire+0x1244/0x2280
> [  +0.000596]        lock_acquire+0xc2/0x2c0
> [  +0.000556]        __mutex_lock+0x80/0x9d0
> [  +0.000580]        vfio_pci_mmap_fault+0x35/0x140 [vfio_pci_core]
> [  +0.000709]        __do_fault+0x32/0xa0
> [  +0.000556]        __handle_mm_fault+0xbe8/0x1450
> [  +0.000606]        handle_mm_fault+0x6c/0x140
> [  +0.000624]        fixup_user_fault+0x6b/0x100
> [  +0.000600]        vaddr_get_pfns+0x108/0x240 [vfio_iommu_type1]
> [  +0.000726]        vfio_pin_pages_remote+0x326/0x460 [vfio_iommu_type1]
> [  +0.000736]        vfio_iommu_type1_ioctl+0x43b/0x15a0 [vfio_iommu_type1]
> [  +0.000752]        __x64_sys_ioctl+0x436/0x9a0
> [  +0.000588]        do_syscall_64+0x3d/0x90
> [  +0.000571]        entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  +0.000677]
>                other info that might help us debug this:
> 
> [  +0.001073] Chain exists of:
>                  &vdev->vma_lock --> &mvdev->state_mutex -->
> &mm->mmap_lock#2
> 
> [  +0.001285]  Possible unsafe locking scenario:
> 
> [  +0.000808]        CPU0                    CPU1
> [  +0.000599]        ----                    ----
> [  +0.000593]   lock(&mm->mmap_lock#2);
> [  +0.000530]                                lock(&mvdev->state_mutex);
> [  +0.000725]                                lock(&mm->mmap_lock#2);
> [  +0.000712]   lock(&vdev->vma_lock);
> [  +0.000532]
>                 *** DEADLOCK ***

Thanks for this information, but this flow is not clear to me why it 
cause deadlock. From this flow, CPU0 is not waiting for any resource
held by CPU1, so after CPU0 releases mmap_lock, CPU1 can continue
to run. Am I missing something? 
Meanwhile, it seems the trace above is triggered with migration
protocol v1, the context of CPU1 listed also seems protocol v1
specific. Is it the case?
Thanks,
Xin

> 
> [  +0.000922] 2 locks held by qemu-system-x86/7731:
> [  +0.000597]  #0: ffff88810c9bec88 (&iommu->lock#2){+.+.}-{3:3}, at:
> vfio_iommu_type1_ioctl+0x189/0x15a0 [vfio_iommu_type1]
> [  +0.001177]  #1: ffff888105f4c5d8 (&mm->mmap_lock#2){++++}-{3:3}, at:
> vaddr_get_pfns+0x64/0x240 [vfio_iommu_type1]
> [  +0.001153]
>                stack backtrace:
> [  +0.000689] CPU: 1 PID: 7731 Comm: qemu-system-x86 Not tainted
> 5.15.0-rc3 #236
> [  +0.000932] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> [  +0.001192] Call Trace:
> [  +0.000454]  dump_stack_lvl+0x45/0x59
> [  +0.000527]  check_noncircular+0xf2/0x110
> [  +0.000557]  __lock_acquire+0x1244/0x2280
> [  +0.002462]  lock_acquire+0xc2/0x2c0
> [  +0.000534]  ? vfio_pci_mmap_fault+0x35/0x140 [vfio_pci_core]
> [  +0.000684]  ? lock_is_held_type+0x98/0x110
> [  +0.000565]  __mutex_lock+0x80/0x9d0
> [  +0.000542]  ? vfio_pci_mmap_fault+0x35/0x140 [vfio_pci_core]
> [  +0.000676]  ? vfio_pci_mmap_fault+0x35/0x140 [vfio_pci_core]
> [  +0.000681]  ? mark_held_locks+0x49/0x70
> [  +0.000551]  ? lock_is_held_type+0x98/0x110
> [  +0.000575]  vfio_pci_mmap_fault+0x35/0x140 [vfio_pci_core]
> [  +0.000679]  __do_fault+0x32/0xa0
> [  +0.000700]  __handle_mm_fault+0xbe8/0x1450
> [  +0.000571]  handle_mm_fault+0x6c/0x140
> [  +0.000564]  fixup_user_fault+0x6b/0x100
> [  +0.000556]  vaddr_get_pfns+0x108/0x240 [vfio_iommu_type1]
> [  +0.000666]  ? lock_is_held_type+0x98/0x110
> [  +0.000572]  vfio_pin_pages_remote+0x326/0x460 [vfio_iommu_type1]
> [  +0.000710]  vfio_iommu_type1_ioctl+0x43b/0x15a0 [vfio_iommu_type1]
> [  +0.001050]  ? find_held_lock+0x2b/0x80
> [  +0.000561]  ? lock_release+0xc2/0x2a0
> [  +0.000537]  ? __fget_files+0xdc/0x1d0
> [  +0.000541]  __x64_sys_ioctl+0x436/0x9a0
> [  +0.000556]  ? lockdep_hardirqs_on_prepare+0xd4/0x170
> [  +0.000641]  do_syscall_64+0x3d/0x90
> [  +0.000529]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  +0.000670] RIP: 0033:0x7f54255bb45b
> [  +0.000541] Code: 0f 1e fa 48 8b 05 2d aa 2c 00 64 c7 00 26 00 00 00
> 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f
> 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fd a9 2c 00 f7 d8 64 89 01 48
> [  +0.001843] RSP: 002b:00007f5415ca3d38 EFLAGS: 00000206 ORIG_RAX:
> 0000000000000010
> [  +0.000929] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
> 00007f54255bb45b
> [  +0.000781] RDX: 00007f5415ca3d70 RSI: 0000000000003b71 RDI:
> 0000000000000012
> [  +0.000775] RBP: 00007f5415ca3da0 R08: 0000000000000000 R09:
> 0000000000000000
> [  +0.000777] R10: 00000000fe002000 R11: 0000000000000206 R12:
> 0000000000000004
> [  +0.000775] R13: 0000000000000000 R14: 00000000fe000000 R15:
> 00007f5415ca47c0
> 
> Yishai