RE: possible recursive locking issue

"Dong, Chuanxiao" <chuanxiao.dong@xxxxxxxxx> · Fri, 7 Jul 2017 02:42:47 +0000

> -----Original Message-----
> From: Alex Williamson [mailto:alex.williamson@xxxxxxxxxx]
> Sent: Friday, July 7, 2017 5:10 AM
> To: Dong, Chuanxiao <chuanxiao.dong@xxxxxxxxx>
> Cc: kwankhede@xxxxxxxxxx; kvm@xxxxxxxxxxxxxxx; 'Zhenyu Wang'
> <zhenyuw@xxxxxxxxxxxxxxx>
> Subject: Re: possible recursive locking issue
> 
> On Thu, 6 Jul 2017 09:39:41 +0000
> "Dong, Chuanxiao" <chuanxiao.dong@xxxxxxxxx> wrote:
> 
> > Hello,
> >
> > We met a possible recursive locking issue and seeking a solution for
> resolving this. The log is looking like below:
> >
> >     [ 5102.127454] ============================================
> >     [ 5102.133379] WARNING: possible recursive locking detected
> >     [ 5102.139304] 4.12.0-rc4+ #3 Not tainted
> >     [ 5102.143483] --------------------------------------------
> >     [ 5102.149407] qemu-system-x86/1620 is trying to acquire lock:
> >     [ 5102.155624]  (&container->group_lock){++++++}, at:
> [<ffffffff817768c6>] vfio_unpin_pages+0x96/0xf0
> >     [ 5102.165626]
> >     but task is already holding lock:
> >     [ 5102.172134]  (&container->group_lock){++++++}, at:
> [<ffffffff8177728f>] vfio_fops_unl_ioctl+0x5f/0x280
> >     [ 5102.182522]
> >     other info that might help us debug this:
> >     [ 5102.189806]  Possible unsafe locking scenario:
> >
> >     [ 5102.196411]        CPU0
> >     [ 5102.199136]        ----
> >     [ 5102.201861]   lock(&container->group_lock);
> >     [ 5102.206527]   lock(&container->group_lock);
> >     [ 5102.211191]
> >     *** DEADLOCK ***
> >
> >     [ 5102.217796]  May be due to missing lock nesting notation
> >
> >     [ 5102.225370] 3 locks held by qemu-system-x86/1620:
> >     [ 5102.230618]  #0:  (&container->group_lock){++++++}, at:
> [<ffffffff8177728f>] vfio_fops_unl_ioctl+0x5f/0x280
> >     [ 5102.241482]  #1:  (&(&iommu->notifier)->rwsem){++++..}, at:
> [<ffffffff810de775>] __blocking_notifier_call_chain+0x35/0x70
> >     [ 5102.253713]  #2:  (&vgpu->vdev.cache_lock){+.+...}, at:
> [<ffffffff8157b007>] intel_vgpu_iommu_notifier+0x77/0x120
> >     [ 5102.265163]
> >     stack backtrace:
> >     [ 5102.270022] CPU: 5 PID: 1620 Comm: qemu-system-x86 Not tainted
> 4.12.0-rc4+ #3
> >     [ 5102.277991] Hardware name: Intel Corporation S1200RP/S1200RP,
> BIOS S1200RP.86B.03.01.APER.061220151418 06/12/2015
> >     [ 5102.289445] Call Trace:
> >     [ 5102.292175]  dump_stack+0x85/0xc7
> >     [ 5102.295871]  validate_chain.isra.21+0x9da/0xaf0
> >     [ 5102.300925]  __lock_acquire+0x405/0x820
> >     [ 5102.305202]  lock_acquire+0xc7/0x220
> >     [ 5102.309191]  ? vfio_unpin_pages+0x96/0xf0
> >     [ 5102.313666]  down_read+0x2b/0x50
> >     [ 5102.317259]  ? vfio_unpin_pages+0x96/0xf0
> >     [ 5102.321732]  vfio_unpin_pages+0x96/0xf0
> >     [ 5102.326024]  intel_vgpu_iommu_notifier+0xe5/0x120
> >     [ 5102.331283]  notifier_call_chain+0x4a/0x70
> >     [ 5102.335851]  __blocking_notifier_call_chain+0x4d/0x70
> >     [ 5102.341490]  blocking_notifier_call_chain+0x16/0x20
> >     [ 5102.346935]  vfio_iommu_type1_ioctl+0x87b/0x920
> >     [ 5102.351994]  vfio_fops_unl_ioctl+0x81/0x280
> >     [ 5102.356660]  ? __fget+0xf0/0x210
> >     [ 5102.360261]  do_vfs_ioctl+0x93/0x6a0
> >     [ 5102.364247]  ? __fget+0x111/0x210
> >     [ 5102.367942]  SyS_ioctl+0x41/0x70
> >     [ 5102.371542]  entry_SYSCALL_64_fastpath+0x1f/0xbe
> >
> > The call stack is:
> > vfio_fops_unl_ioctl -> vfio_iommu_type1_ioctl -> vfio_dma_do_unmap ->
> blocking_notifier_call_chain -> intel_vgpu_iommu_notifier ->
> vfio_unpin_pages.
> >
> > The container->group_lock is hold in vfio_fops_unl_ioctl first but then it
> will be hold again in vfio_unpin_pages.
> 
> This doesn't make sense to me, but then lockdep splats using don't to me at
> first.  If we're passing through vfio_fops_unl_ioctl() for a
> VFIO_IOMMU_UNMAP_DMA, then we'll be holding a read-lock on
> container->group_lock.  vfio_unpin_pages() also takes a read-lock on
> the same.  Why is this a problem?  We should be able to nest read-locks.

Hi Alex,

This trace is captured with CONFIG_DEBUG_LOCK_ALLOC. From below comment in include/linux/rwsem.h:
#ifdef CONFIG_DEBUG_LOCK_ALLOC
/*
 * nested locking. NOTE: rwsems are not allowed to recurse
 * (which occurs if the same task tries to acquire the same
 * lock instance multiple times), but multiple locks of the
 * same lock class might be taken, if the order of the locks
 * is always the same. This ordering rule can be expressed
 * to lockdep via the _nested() APIs, but enumerating the
 * subclasses that are used. (If the nesting relationship is
 * static then another method for expressing nested locking is
 * the explicit definition of lock class keys and the use of
 * lockdep_set_class() at lock initialization time.
 * See Documentation/locking/lockdep-design.txt for more details.)
 */
extern void down_read_nested(struct rw_semaphore *sem, int subclass);

So looks like using down_read_nested in vfio_unping_pages is a better choice than down_read if we know it is possible that rwsems will be nested. When CONFIG_DEBUG_LOCK_ALLOC is not set, down_read_nested is just the same with down_read. What do you think?

Thanks
Chuanxiao
> 
> > Regarding this, put the vfio_unpin_pages in another thread can resolve
> > this recursive locking. In this way, vfio_unpin_pages will be
> > asynchronies with the vfio_dma_do_unmap. Then it is possible to
> > trigger below kernel panic due to this asynchronies:
> 
> This is an invalid solution and the code is punishing you for it ;) The user is
> requesting to unmap pages and we must release those pages before the
> kernel ioctl returns.  As you can see near the BUG_ON hit below, we'll
> retrigger the blocking notifier call chain 10 times to try to get the page we
> need released.  If each one of those starts a thread, there's no guarantee
> that any of them will run before we hit our retry limit.  The below is
> completely expected in that case. Thanks,
> 
> Alex
> 
> > [ 4468.975091] ------------[ cut here ]------------ [ 4468.976145]
> > kernel BUG at drivers/vfio/vfio_iommu_type1.c:833!
> > [ 4468.977193] invalid opcode: 0000 [#1] SMP [ 4468.978232] Modules
> > linked in: bridge stp llc nls_iso8859_1 intel_rapl
> > x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel joydev
> > input_leds crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc
> > aesni_intel aes_x86_64 crypto_simd hci_uart glue_helper shpchp btbcm
> > cryptd btqca winbond_cir rc_core btintel ipmi_ssif mei_me mei
> > intel_pch_thermal bluetooth acpi_als kfifo_buf industrialio
> > soc_button_array intel_vbtn ipmi_devintf ipmi_msghandler intel_hid
> > ecdh_generic intel_lpss_acpi intel_lpss spidev sparse_keymap
> > acpi_power_meter mac_hid sunrpc parport_pc ppdev lp parport autofs4
> > kvmgt vfio_mdev mdev vfio_iommu_type1 vfio kvm irqbypass hid_generic
> > usbhid i915 drm_kms_helper igb e1000e syscopyarea sysfillrect
> > sysimgblt dca fb_sys_fops ptp pps_core drm i2c_algo_bit ahci libahci
> > wmi video pinctrl_sunrisepoint [ 4468.982783]  pinctrl_intel i2c_hid
> > hid [ 4468.983995] CPU: 3 PID: 1549 Comm: qemu-system-x86 Not tainted
> > 4.12.0-rc7+ #1 [ 4468.985132] Hardware name: Intel Corporation
> > Kabylake Greenlow Refresh UP Server Platform/Zumba Beach Server EV,
> BIOS KBLSE2R1.R00.0006.B08.1702011304 02/ [ 4468.986350] task:
> ffff8cd9afa28000 task.stack: ffffb136c2f68000 [ 4468.987545] RIP:
> 0010:vfio_iommu_type1_ioctl+0x894/0x910 [vfio_iommu_type1]
> [ 4468.988864] RSP: 0018:ffffb136c2f6bd58 EFLAGS: 00010202
> [ 4468.990106] RAX: 0000000000100000 RBX: 00007f80f55b1410 RCX:
> 000000007ff00000 [ 4468.991290] RDX: ffff8cd9af105d00 RSI:
> ffff8cd9b4835e40 RDI: 000000000000000b [ 4468.992536] RBP:
> ffffb136c2f6be30 R08: 0000000000100000 R09: 0000000080000000
> [ 4468.993749] R10: ffffb136c2f6bd30 R11: 000000000000013b R12:
> 0000000000000000 [ 4468.994991] R13: ffff8cd9ad813b80 R14:
> ffff8cd9afa28000 R15: ffffb136c2f6bdc8 [ 4468.996131] FS:
> 00007f80f55b2700(0000) GS:ffff8cd9c7d80000(0000)
> knlGS:0000000000000000 [ 4468.997305] CS:  0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033 [ 4468.998525] CR2: 000001f016dd2218 CR3:
> 0000000472bfc000 CR4: 00000000003426e0 [ 4468.999668] Call Trace:
> > [ 4469.000916]  ? kvm_set_memory_region+0x38/0x60 [kvm]
> [ 4469.002072]
> > vfio_fops_unl_ioctl+0x7b/0x260 [vfio] [ 4469.003220]
> > do_vfs_ioctl+0xa1/0x5d0 [ 4469.004443]  ? SyS_futex+0x7f/0x180 [
> > 4469.005567]  SyS_ioctl+0x79/0x90 [ 4469.006661]
> > entry_SYSCALL_64_fastpath+0x1e/0xa9
> >
> > Can you help to check this recursive locking issue?
> >
> > Thanks
> > Chuanxiao
> >