On Tue, Mar 07, 2017 at 11:03:03AM +0000, Tvrtko Ursulin wrote: > > On 07/03/2017 09:13, Chris Wilson wrote: > >On Tue, Mar 07, 2017 at 08:42:26AM +0000, Tvrtko Ursulin wrote: > >> > >>On 06/03/2017 15:36, Chris Wilson wrote: > >>>If we allow the user to convert a GTT mmap address into a userptr, we > >>>may end up in recursion hell, where currently we hit a mutex deadlock > >>>but other possibilities include use-after-free during the > >>>unbind/cancel_userptr. > >> > >>I thought we already disallowed that and indeed > >>igt/gem_userptr_blits/invalid-gtt-mapping succeeds here - what am I > >>missing? > > > >Michał presented this deadlock: > > > >[ 143.203989] gem_userptr_bli D 0 902 898 0x00000000 > >[ 143.204054] Call Trace: > >[ 143.204137] __schedule+0x511/0x1180 > >[ 143.204195] ? pci_mmcfg_check_reserved+0xc0/0xc0 > >[ 143.204274] schedule+0x57/0xe0 > >[ 143.204327] schedule_timeout+0x383/0x670 > >[ 143.204374] ? trace_hardirqs_on_caller+0x187/0x280 > >[ 143.204457] ? trace_hardirqs_on_thunk+0x1a/0x1c > >[ 143.204507] ? usleep_range+0x110/0x110 > >[ 143.204657] ? irq_exit+0x89/0x100 > >[ 143.204710] ? retint_kernel+0x2d/0x2d > >[ 143.204794] ? trace_hardirqs_on_caller+0x187/0x280 > >[ 143.204857] ? _raw_spin_unlock_irq+0x33/0x60 > >[ 143.204944] wait_for_common+0x1f0/0x2f0 > >[ 143.205006] ? out_of_line_wait_on_atomic_t+0x170/0x170 > >[ 143.205103] ? wake_up_q+0xa0/0xa0 > >[ 143.205159] ? flush_workqueue_prep_pwqs+0x15a/0x2c0 > >[ 143.205237] wait_for_completion+0x1d/0x20 > >[ 143.205292] flush_workqueue+0x2e9/0xbb0 > >[ 143.205339] ? flush_workqueue+0x163/0xbb0 > >[ 143.205418] ? __schedule+0x533/0x1180 > >[ 143.205498] ? check_flush_dependency+0x1a0/0x1a0 > >[ 143.205681] i915_gem_userptr_mn_invalidate_range_start+0x1c7/0x270 [i915] > >[ 143.205865] ? i915_gem_userptr_dmabuf_export+0x40/0x40 [i915] > >[ 143.205955] __mmu_notifier_invalidate_range_start+0xc6/0x120 > >[ 143.206044] ? __mmu_notifier_invalidate_range_start+0x51/0x120 > >[ 143.206123] zap_page_range_single+0x1c7/0x1f0 > >[ 143.206171] ? unmap_single_vma+0x160/0x160 > >[ 143.206260] ? unmap_mapping_range+0xa9/0x1b0 > >[ 143.206308] ? vma_interval_tree_subtree_search+0x75/0xd0 > >[ 143.206397] unmap_mapping_range+0x18f/0x1b0 > >[ 143.206444] ? zap_vma_ptes+0x70/0x70 > >[ 143.206524] ? __pm_runtime_resume+0x67/0xa0 > >[ 143.206723] i915_gem_release_mmap+0x1ba/0x1c0 [i915] > >[ 143.206846] i915_vma_unbind+0x5c2/0x690 [i915] > >[ 143.206925] ? __lock_is_held+0x52/0x100 > >[ 143.207076] i915_gem_object_set_tiling+0x1db/0x650 [i915] > >[ 143.207236] i915_gem_set_tiling_ioctl+0x1d3/0x3b0 [i915] > >[ 143.207377] ? i915_gem_set_tiling_ioctl+0x5/0x3b0 [i915] > >[ 143.207457] drm_ioctl+0x36c/0x670 > >[ 143.207535] ? debug_lockdep_rcu_enabled.part.0+0x1a/0x30 > >[ 143.207730] ? i915_gem_object_set_tiling+0x650/0x650 [i915] > >[ 143.207793] ? drm_getunique+0x120/0x120 > >[ 143.207875] ? __handle_mm_fault+0x996/0x14a0 > >[ 143.207939] ? vm_insert_page+0x340/0x340 > >[ 143.208028] ? up_write+0x28/0x50 > >[ 143.208086] ? vm_mmap_pgoff+0x160/0x190 > >[ 143.208163] do_vfs_ioctl+0x12c/0xa60 > >[ 143.208218] ? debug_lockdep_rcu_enabled+0x35/0x40 > >[ 143.208267] ? ioctl_preallocate+0x150/0x150 > >[ 143.208353] ? __do_page_fault+0x36a/0x6e0 > >[ 143.208400] ? mark_held_locks+0x23/0xc0 > >[ 143.208479] ? up_read+0x1f/0x40 > >[ 143.208526] ? entry_SYSCALL_64_fastpath+0x5/0xc6 > >[ 143.208669] ? __fget_light+0xa7/0xc0 > >[ 143.208747] SyS_ioctl+0x41/0x70 > >[ 143.208808] entry_SYSCALL_64_fastpath+0x23/0xc6 > > That would mean another process never exits cancel_userptr, correct? > Do we have a trace of the other end? Our worker, who is waiting on struct_mutex (in order to unbind) which we are holding ourselves as we are in the middle of an unbind. The nasty part here is that we can recurse into unbind on the same object (let alone the mutex deadlock). That's impossible if we prevent the object from activating itself on a GTT mmap. ARGH. The deadlock is between a non-userptr and a set of userptr (that explains how we get to the unbind). It is just that the other userptr are still in the process of running their workers when the time comes to cancel the work. (Avoids the dilemma of how we ended up with bound userptr of the GTT). Simplest patch is then fun with obj->userptr.work, i.e. something like if (xchg(&obj->userptr.work, NULL)) return 0; -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx