On Tue, Mar 27, 2018 at 08:19:33AM +0100, Chris Wilson wrote: > Quoting Daniel Vetter (2018-03-27 07:48:00) > > On Mon, Mar 26, 2018 at 11:38:55PM +0100, Chris Wilson wrote: > > > Quoting Chris Wilson (2018-03-26 21:08:33) > > > > Quoting Patchwork (2018-03-26 17:53:44) > > > > > Test gem_userptr_blits: > > > > > Subgroup coherency-unsync: > > > > > pass -> INCOMPLETE (shard-hsw) > > > > > > > > Forgot that obj->userptr.mn may not exist. > > > > > > > > > Subgroup dmabuf-sync: > > > > > pass -> DMESG-WARN (shard-hsw) > > > > > > > > But this is the tricky lockdep one, warning of the recursion from gup > > > > into mmu_invalidate_range, i.e. > > > > > > > > down_read(&i915_mmu_notifier->sem); > > > > down_read(&mm_struct->mmap_sem); > > > > gup(); > > > > down_write(&i915_mmut_notifier->sem); > > > > > > > > That seems a genuine deadlock... So I wonder how we managed to get a > > > > lockdep splat and not a dead machine. Maybe gup never triggers the > > > > recursion for our set of flags? Hmm. > > > > > > In another universe, CI found > > > > > > [ 255.666496] ====================================================== > > > [ 255.666498] WARNING: possible circular locking dependency detected > > > [ 255.666500] 4.16.0-rc6-CI-Trybot_1944+ #1 Tainted: G U W > > > [ 255.666502] ------------------------------------------------------ > > > [ 255.666503] gem_userptr_bli/4794 is trying to acquire lock: > > > [ 255.666505] (fs_reclaim){+.+.}, at: [<00000000e1b95c73>] fs_reclaim_acquire.part.12+0x0/0x30 > > > [ 255.666510] > > > but task is already holding lock: > > > [ 255.666512] (&mn->sem){+.+.}, at: [<000000007c59ba79>] i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 [i915] > > > [ 255.666553] > > > which lock already depends on the new lock. > > > > > > [ 255.666555] > > > the existing dependency chain (in reverse order) is: > > > [ 255.666557] > > > -> #2 (&mn->sem){+.+.}: > > > [ 255.666578] i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 [i915] > > > [ 255.666581] __mmu_notifier_invalidate_range_start+0x73/0xb0 > > > [ 255.666584] zap_page_range_single+0xcc/0xe0 > > > [ 255.666586] unmap_mapping_pages+0xd4/0x110 > > > [ 255.666606] i915_vma_revoke_mmap+0x7e/0x1c0 [i915] > > > [ 255.666625] i915_vma_unbind+0x60a/0xa10 [i915] > > > [ 255.666644] i915_gem_object_set_tiling+0xf6/0x5b0 [i915] > > > [ 255.666662] i915_gem_set_tiling_ioctl+0x262/0x2f0 [i915] > > > [ 255.666665] drm_ioctl_kernel+0x60/0xa0 > > > [ 255.666667] drm_ioctl+0x27e/0x320 > > > [ 255.666669] do_vfs_ioctl+0x8a/0x670 > > > [ 255.666670] SyS_ioctl+0x36/0x70 > > > [ 255.666672] do_syscall_64+0x65/0x1a0 > > > [ 255.666675] entry_SYSCALL_64_after_hwframe+0x42/0xb7 > > > [ 255.666676] > > > -> #1 (&mapping->i_mmap_rwsem){++++}: > > > [ 255.666680] unmap_mapping_pages+0x3d/0x110 > > > [ 255.666698] i915_vma_revoke_mmap+0x7e/0x1c0 [i915] > > > [ 255.666716] i915_vma_unbind+0x60a/0xa10 [i915] > > > [ 255.666734] i915_gem_object_unbind+0xa0/0x130 [i915] > > > [ 255.666751] i915_gem_shrink+0x2d1/0x5d0 [i915] > > > [ 255.666767] i915_drop_caches_set+0x92/0x190 [i915] > > > [ 255.666770] simple_attr_write+0xab/0xc0 > > > [ 255.666772] full_proxy_write+0x4b/0x70 > > > [ 255.666774] __vfs_write+0x1e/0x130 > > > [ 255.666776] vfs_write+0xbd/0x1b0 > > > [ 255.666778] SyS_write+0x40/0xa0 > > > [ 255.666779] do_syscall_64+0x65/0x1a0 > > > [ 255.666781] entry_SYSCALL_64_after_hwframe+0x42/0xb7 > > > [ 255.666783] > > > -> #0 (fs_reclaim){+.+.}: > > > [ 255.666786] fs_reclaim_acquire.part.12+0x24/0x30 > > > [ 255.666788] __alloc_pages_nodemask+0x1f1/0x11d0 > > > [ 255.666790] __get_free_pages+0x9/0x40 > > > [ 255.666792] __pud_alloc+0x25/0xb0 > > > [ 255.666794] copy_page_range+0xa75/0xaf0 > > > [ 255.666796] copy_process.part.7+0x1267/0x1d90 > > > [ 255.666798] _do_fork+0xc0/0x6b0 > > > [ 255.666800] do_syscall_64+0x65/0x1a0 > > > [ 255.666801] entry_SYSCALL_64_after_hwframe+0x42/0xb7 > > > [ 255.666803] > > > other info that might help us debug this: > > > > > > [ 255.666805] Chain exists of: > > > fs_reclaim --> &mapping->i_mmap_rwsem --> &mn->sem > > > > > > [ 255.666809] Possible unsafe locking scenario: > > > > > > [ 255.666811] CPU0 CPU1 > > > [ 255.666812] ---- ---- > > > [ 255.666814] lock(&mn->sem); > > > [ 255.666815] lock(&mapping->i_mmap_rwsem); > > > [ 255.666817] lock(&mn->sem); > > > [ 255.666819] lock(fs_reclaim); > > > [ 255.666821] > > > > > > So a shrinker deadlock. That doesn't look easy to wriggle out of, as we > > > have a random chunk of code that's between invalidate_range_start and > > > invalidate_range_end. > > > > Christian König said something like "with this design you can't allocate > > anything while holding locks you might need from the mmu notifier". > > Because reclaim eats into the mmu notifiers. > > Oh, we aren't allocating from under the locks. That's the first thing I > double checked. Afaict, the only window is the code in the caller that's > between range_start/range_end. If that also can't touch fs_reclaim, then > this is just a red herring... Where is that happening? You left out the backtrace for the fs_reclaim hit that closed the loop. Is it our code, or just something else going on? -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx