Quoting Thomas Hellström (Intel) (2020-07-29 14:44:41) > > On 7/29/20 2:17 PM, Tvrtko Ursulin wrote: > > > > On 28/07/2020 12:17, Thomas Hellström (Intel) wrote: > >> On 7/16/20 5:53 PM, Tvrtko Ursulin wrote: > >>> On 15/07/2020 16:43, Maarten Lankhorst wrote: > >>>> Op 15-07-2020 om 13:51 schreef Chris Wilson: > >>>>> Our goal is to pull all memory reservations (next iteration > >>>>> obj->ops->get_pages()) under a ww_mutex, and to align those > >>>>> reservations > >>>>> with other drivers, i.e. control all such allocations with the > >>>>> reservation_ww_class. Currently, this is under the purview of the > >>>>> obj->mm.mutex, and while obj->mm remains an embedded struct we can > >>>>> "simply" switch to using the reservation_ww_class > >>>>> obj->base.resv->lock > >>>>> > >>>>> The major consequence is the impact on the shrinker paths as the > >>>>> reservation_ww_class is used to wrap allocations, and a ww_mutex does > >>>>> not support subclassing so we cannot do our usual trick of knowing > >>>>> that > >>>>> we never recurse inside the shrinker and instead have to finish the > >>>>> reclaim with a trylock. This may result in us failing to release the > >>>>> pages after having released the vma. This will have to do until a > >>>>> better > >>>>> idea comes along. > >>>>> > >>>>> However, this step only converts the mutex over and continues to > >>>>> treat > >>>>> everything as a single allocation and pinning the pages. With the > >>>>> ww_mutex in place we can remove the temporary pinning, as we can then > >>>>> reserve all storage en masse. > >>>>> > >>>>> One last thing to do: kill the implict page pinning for active vma. > >>>>> This will require us to invalidate the vma->pages when the backing > >>>>> store > >>>>> is removed (and we expect that while the vma is active, we mark the > >>>>> backing store as active so that it cannot be removed while the HW is > >>>>> busy.) > >>>>> > >>>>> Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > >>> > >>> [snip] > >>> > >>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c > >>>>> b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c > >>>>> index dc8f052a0ffe..4e928103a38f 100644 > >>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c > >>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c > >>>>> @@ -47,10 +47,7 @@ static bool unsafe_drop_pages(struct > >>>>> drm_i915_gem_object *obj, > >>>>> if (!(shrink & I915_SHRINK_BOUND)) > >>>>> flags = I915_GEM_OBJECT_UNBIND_TEST; > >>>>> - if (i915_gem_object_unbind(obj, flags) == 0) > >>>>> - __i915_gem_object_put_pages(obj); > >>>>> - > >>>>> - return !i915_gem_object_has_pages(obj); > >>>>> + return i915_gem_object_unbind(obj, flags) == 0; > >>>>> } > >>>>> static void try_to_writeback(struct drm_i915_gem_object *obj, > >>>>> @@ -199,14 +196,14 @@ i915_gem_shrink(struct drm_i915_private *i915, > >>>>> spin_unlock_irqrestore(&i915->mm.obj_lock, flags); > >>>>> - if (unsafe_drop_pages(obj, shrink)) { > >>>>> - /* May arrive from get_pages on another bo */ > >>>>> - mutex_lock(&obj->mm.lock); > >>>>> + if (unsafe_drop_pages(obj, shrink) && > >>>>> + i915_gem_object_trylock(obj)) { > >>> > >>>> Why trylock? Because of the nesting? In that case, still use ww ctx > >>>> if provided please > >>> > >>> By "if provided" you mean for code paths where we are calling the > >>> shrinker ourselves, as opposed to reclaim, like shmem_get_pages? > >>> > >>> That indeed sounds like the right thing to do, since all the > >>> get_pages from execbuf are in the reservation phase, collecting a > >>> list of GEM objects to lock, the ones to shrink sound like should be > >>> on that list. > >>> > >>>>> + __i915_gem_object_put_pages(obj); > >>>>> if (!i915_gem_object_has_pages(obj)) { > >>>>> try_to_writeback(obj, shrink); > >>>>> count += obj->base.size >> PAGE_SHIFT; > >>>>> } > >>>>> - mutex_unlock(&obj->mm.lock); > >>>>> + i915_gem_object_unlock(obj); > >>>>> } > >>>>> scanned += obj->base.size >> PAGE_SHIFT; > >>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c > >>>>> b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c > >>>>> index ff72ee2fd9cd..ac12e1c20e66 100644 > >>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c > >>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c > >>>>> @@ -265,7 +265,6 @@ i915_gem_object_set_tiling(struct > >>>>> drm_i915_gem_object *obj, > >>>>> * pages to prevent them being swapped out and causing > >>>>> corruption > >>>>> * due to the change in swizzling. > >>>>> */ > >>>>> - mutex_lock(&obj->mm.lock); > >>>>> if (i915_gem_object_has_pages(obj) && > >>>>> obj->mm.madv == I915_MADV_WILLNEED && > >>>>> i915->quirks & QUIRK_PIN_SWIZZLED_PAGES) { > >>>>> @@ -280,7 +279,6 @@ i915_gem_object_set_tiling(struct > >>>>> drm_i915_gem_object *obj, > >>>>> obj->mm.quirked = true; > >>>>> } > >>>>> } > >>>>> - mutex_unlock(&obj->mm.lock); > >>>>> spin_lock(&obj->vma.lock); > >>>>> for_each_ggtt_vma(vma, obj) { > >>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c > >>>>> b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c > >>>>> index e946032b13e4..80907c00c6fd 100644 > >>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c > >>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c > >>>>> @@ -129,8 +129,15 @@ userptr_mn_invalidate_range_start(struct > >>>>> mmu_notifier *_mn, > >>>>> ret = i915_gem_object_unbind(obj, > >>>>> I915_GEM_OBJECT_UNBIND_ACTIVE | > >>>>> I915_GEM_OBJECT_UNBIND_BARRIER); > >>>>> - if (ret == 0) > >>>>> - ret = __i915_gem_object_put_pages(obj); > >>>>> + if (ret == 0) { > >>>>> + /* ww_mutex and mmu_notifier is fs_reclaim tainted */ > >>>>> + if (i915_gem_object_trylock(obj)) { > >>>>> + ret = __i915_gem_object_put_pages(obj); > >>>>> + i915_gem_object_unlock(obj); > >>>>> + } else { > >>>>> + ret = -EAGAIN; > >>>>> + } > >>>>> + } > >>>> > >>>> I'm not sure upstream will agree with this kind of API: > >>>> > >>>> 1. It will deadlock when RT tasks are used. > >>> > >>> It will or it can? Which part? It will break out of the loop if > >>> trylock fails. > >>> > >>>> > >>>> 2. You start throwing -EAGAIN because you don't have the correct > >>>> ordering of locking, this needs fixing first. > >>> > >>> Is it about correct ordering of locks or something else? If memory > >>> allocation is allowed under dma_resv.lock, then the opposite order > >>> cannot be taken in any case. > >>> > >>> I've had a brief look at the amdgpu solution and maybe I > >>> misunderstood something, but it looks like a BKL approach with the > >>> device level notifier_lock. Their userptr notifier blocks on that > >>> one, not on dma_resv lock, but that also means their command > >>> submission (amdgpu_cs_submit) blocks on the same lock while > >>> obtaining backing store. > >> > >> If I read Christian right, it blocks on that lock only just before > >> command submission to validate that sequence number. If there is a > >> mismatch, it needs to rerun CS. I'm not sure how common userptr > >> buffers are, but if a device-wide mutex hurts too much, There are > >> perhaps more fine-grained solutions. (Like an rw semaphore, and > >> unlock before the fence wait in the notifier: CS which are unaffected > >> shouldn't need to wait...). > > > > Right, I wasn't familiar with the mmu notifier seqno business. Hm i915 > > and amdgpu seem to use different mmu notifier hooks. Could we use the > > same and does it mean "locking order" is irrelevant to the sub-story > > of userptr? > > > >> > >>> > >>> So it looks like a big hammer approach not directly related to the > >>> story of dma_resv locking. Maybe we could do the same big hammer > >>> approach, although I am not sure how it is deadlock free. > >>> > >>> What happens for instance if someone submits an userptr batch which > >>> gets unmapped while amdgpu_cs_submit is holding the notifier_lock? > >> > >> My understanding is the unmapping operation blocks on the > >> notifier_lock in the mmu notifier? > > > > Yes, the key will be understanding the difference between > > invalidate_range_start and invalidate+seqno and whether we can use the > > same from i915 to avoid the trylock. > > > > But given how ww mutex is tainted fs reclaim does that mean we would > > also need to ensure no memory allocations under ww mutex? > > So with the fs_reclaim, we can do memory allocations under ww_mutex, but > need trylock in shrinkers and mmu_notifiers. However get_user_pages() > must not run under ww_mutex due to mmap_sem inversion. There's no real reason for the shrinker to call into mmu-notifiers, certainly doesn't work for us where we handle the shrinking via another shrinker. The debate is between adding skip for page_maybe_dma_pinned to the lru shrink and for for never adding pinned dma pages to the lru list in the first place. > But with the AMD solution I figure we don't really need the ww_mutex in > the mmu_notifier at all? That would possibly mean holding on to and > exposing to GPU its pages that aren't used anymore, but a shrinker would > notice the sequence number mismatch and consider the pages purgeable, > and also the next cs the stale pages would be released. You need some lock to coordinate to give the guarantee required by the mmu-notifier interface we use. HMM is likely the only way around that, by moving the goalposts entirely. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx