On Thu, Nov 19, 2015 at 09:42:17AM +0000, Tvrtko Ursulin wrote: > > On 19/11/15 09:17, Daniel Vetter wrote: > >On Wed, Nov 18, 2015 at 05:18:30PM +0000, Tvrtko Ursulin wrote: > >> > >>On 17/11/15 17:56, Daniel Vetter wrote: > >>>On Tue, Nov 17, 2015 at 05:24:01PM +0000, Tvrtko Ursulin wrote: > >>>> > >>>>On 17/11/15 17:08, Daniel Vetter wrote: > >>>>>On Tue, Nov 17, 2015 at 04:54:50PM +0000, Tvrtko Ursulin wrote: > >>>>>> > >>>>>>On 17/11/15 16:39, Daniel Vetter wrote: > >>>>>>>On Tue, Nov 17, 2015 at 04:27:12PM +0000, Tvrtko Ursulin wrote: > >>>>>>>>From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx> > >>>>>>>> > >>>>>>>>In the following commit: > >>>>>>>> > >>>>>>>> commit e9f24d5fb7cf3628b195b18ff3ac4e37937ceeae > >>>>>>>> Author: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx> > >>>>>>>> Date: Mon Oct 5 13:26:36 2015 +0100 > >>>>>>>> > >>>>>>>> drm/i915: Clean up associated VMAs on context destruction > >>>>>>>> > >>>>>>>>I added a WARN_ON assertion that VM's active list must be empty > >>>>>>>>at the time of owning context is getting freed, but that turned > >>>>>>>>out to be a wrong assumption. > >>>>>>>> > >>>>>>>>Due ordering of operations in i915_gem_object_retire__read, where > >>>>>>>>contexts are unreferenced before VMAs are moved to the inactive > >>>>>>>>list, the described situation can in fact happen. > >>>>>>>> > >>>>>>>>It feels wrong to do things in such order so this fix makes sure > >>>>>>>>a reference to context is held until the move to inactive list > >>>>>>>>is completed. > >>>>>>>> > >>>>>>>>v2: Rather than hold a temporary context reference move the > >>>>>>>> request unreference to be the last operation. (Daniel Vetter) > >>>>>>>> > >>>>>>>>v3: Fix use after free. (Chris Wilson) > >>>>>>>> > >>>>>>>>Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx> > >>>>>>>>Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92638 > >>>>>>>>Cc: Michel Thierry <michel.thierry@xxxxxxxxx> > >>>>>>>>Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > >>>>>>>>Cc: Daniel Vetter <daniel.vetter@xxxxxxxx> > >>>>>>>>--- > >>>>>>>> drivers/gpu/drm/i915/i915_gem.c | 33 ++++++++++++++++++--------------- > >>>>>>>> 1 file changed, 18 insertions(+), 15 deletions(-) > >>>>>>>> > >>>>>>>>diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > >>>>>>>>index 98c83286ab68..094ac17a712d 100644 > >>>>>>>>--- a/drivers/gpu/drm/i915/i915_gem.c > >>>>>>>>+++ b/drivers/gpu/drm/i915/i915_gem.c > >>>>>>>>@@ -2404,29 +2404,32 @@ i915_gem_object_retire__read(struct drm_i915_gem_object *obj, int ring) > >>>>>>>> RQ_BUG_ON(!(obj->active & (1 << ring))); > >>>>>>>> > >>>>>>>> list_del_init(&obj->ring_list[ring]); > >>>>>>>>- i915_gem_request_assign(&obj->last_read_req[ring], NULL); > >>>>>>>> > >>>>>>>> if (obj->last_write_req && obj->last_write_req->ring->id == ring) > >>>>>>>> i915_gem_object_retire__write(obj); > >>>>>>>> > >>>>>>>> obj->active &= ~(1 << ring); > >>>>>>>>- if (obj->active) > >>>>>>>>- return; > >>>>>>> > >>>>>>> if (obj->active) { > >>>>>>> i915_gem_request_assign(&obj->last_read_req[ring], NULL); > >>>>>>> return; > >>>>>>> } > >>>>>>> > >>>>>>>Would result in less churn in the code and drop the unecessary indent > >>>>>>>level. Also comment is missing as to why we need to do things in a > >>>>>>>specific order. > >>>>>> > >>>>>>Actually I think I changed my mind and that v1 is the way to go. > >>>>>> > >>>>>>Just re-ordering the code here still makes it possible for the context > >>>>>>destructor to run with VMAs on the active list I think. > >>>>>> > >>>>>>If we hold the context then it is 100% clear it is not possible. > >>>>> > >>>>>request_assign _is_ the function which adjust the refcounts for us, which > >>>>>means if we drop that reference too early then grabbing a temp reference > >>>>>is just papering over the real bug. > >>>>> > >>>>>Written out your patch looks something like > >>>>> > >>>>> a_reference(a); > >>>>> a_unreference(a); > >>>>> > >>>>> /* more cleanup code that should get run before a_unreference but isn't */ > >>>>> > >>>>> a_unrefernce(a); /* for real this time */ > >>>>> > >>>>>Unfortunately foo_assign is a new pattern and not well-established, so > >>>>>that connection isn't clear. Maybe we should rename it to > >>>>>foo_reference_assign to make it more obvious. Or just drop the pretense > >>>>>and open-code it since we unconditionally assign NULL as the new pointer > >>>>>value, and we know the current value of the pointer is non-NULL. So > >>>>>there's really no benefit to the helper here, it only obfuscates. And > >>>>>since that obfuscation tripped you up it's time to remove it ;-) > >>>> > >>>>Then foo_reference_unreference_assign. :) > >>>> > >>>>But seriously, I think it is more complicated that.. > >>>> > >>>>The thing it trips over is that moving VMAs to inactive does not correspond > >>>>in time to request retirement. But in fact VMAs are moved to inactive only > >>>>when all requests associated with an object are done. > >>>> > >>>>This is the unintuitive thing I was working around. To make sure when > >>>>context destructor runs there are not active VMAs for that VM. > >>>> > >>>>I don't know how to guarantee that with what you propose. Perhaps I am > >>>>missing something? > >>> > >>>Ok, my example was slightly off, since we have 2 objects: > >>> > >>> b_reference(a->b); > >>> a_unreference(a); /* might unref a->b if it's the last reference */ > >>> > >>> /* more cleanup code that should get run before a_unreference but isn't */ > >>> > >>> b_unrefernce(a->b); /* for real this time */ > >>> > >>>Holding the ref to a makes sure that b doesn't disappear. We rely on that > >>>in a fundamental way (a request really needs the ctx to stick around), and > >>>the bug really is that we drop the ref to a too early. That it's the > >>>releasing of a->b which is eventually blowing things up doesn't really > >>>matter. > >>> > >>>Btw would it be possible to have an igt for this? I should be possible to > >>>hit this with some varian of gem_unref_active_buffers. > >> > >>I was trying to do that today and it is proving to be a bit tricky. > >> > >>I need a blitter workload which will run for long enough for the retire > >>worker to run. So I'll try and build a bit bb tomorrow which will do that. > >> > >>Alternatively I did this: > >> > >>diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c > >>index 6ed7d63a0688..db51e4b42a20 100644 > >>--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c > >>+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c > >>@@ -699,6 +699,8 @@ i915_gem_execbuffer_reserve(struct intel_engine_cs *ring, > >> int retry; > >> > >> i915_gem_retire_requests_ring(ring); > >>+ if (i915.enable_execlists) > >>+ intel_execlists_retire_requests(ring); > >> > >> vm = list_first_entry(vmas, struct i915_vma, exec_list)->vm; > >> > >>And that enables me to trigger the WARN from my igt even with the current > >>shorter blitter workload (copy 64Mb). > >> > >>Going back to the original problem, how about something like this hunk for a fix? > >> > >>@@ -2413,19 +2416,36 @@ static void > >> i915_gem_object_retire__read(struct drm_i915_gem_object *obj, int ring) > >> { > >> struct i915_vma *vma; > >>+ struct i915_hw_ppgtt *ppgtt; > >> > >> RQ_BUG_ON(obj->last_read_req[ring] == NULL); > >> RQ_BUG_ON(!(obj->active & (1 << ring))); > >> > >> list_del_init(&obj->ring_list[ring]); > >>+ > >>+ ppgtt = obj->last_read_req[ring]->ctx->ppgtt; > >>+ if (ppgtt) { > >>+ list_for_each_entry(vma, &obj->vma_list, vma_link) { > >>+ if (vma->vm == &ppgtt->base && > >>+ !list_empty(&vma->mm_list)) { > >>+ list_move_tail(&vma->mm_list, > >>+ &vma->vm->inactive_list); > >>+ } > >>+ } > >>+ } > >>+ > >> i915_gem_request_assign(&obj->last_read_req[ring], NULL); > >> > >>This moves VMAs immediately to inactive as requests are retired and avoids > >>the problem with them staying on active for undefined amount of time. > > > >You can't put active objects onto the inactive list, i.e. the obj->active > > Hm it is inactive in this VM so who would care? The shrinker will fall over because of some long standing design mistake. Well, that's only the obj->active vs. vma->active problem. The real one is that you can have piles of concurrent read requests on a specific vma even, and retiring the first one of those doesn't make the vma inactive. So even with the active tracking design mistake fixed your suggestion wouldn't work. > Perhaps then the fix is simply to remove the > WARN_ON(!list_empty(&ppgtt->base.active_list)) from the context destructor. > > If there are active VMAs at that point, they'll get cleaned up when they are > retired and there is no leak. > > >check is non-optional. And the if (ppgtt) case is abstraction violation. > >I really don't get why we can't just move the unref to the right place ... > > I don't see where. Please explain why the below change (which is the one I've been proposing, and which Chris suggested too) doesn't work: diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 9552647a925d..d16b5ca042fa 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -2375,14 +2375,15 @@ i915_gem_object_retire__read(struct drm_i915_gem_object *obj, int ring) RQ_BUG_ON(!(obj->active & (1 << ring))); list_del_init(&obj->ring_list[ring]); - i915_gem_request_assign(&obj->last_read_req[ring], NULL); if (obj->last_write_req && obj->last_write_req->ring->id == ring) i915_gem_object_retire__write(obj); obj->active &= ~(1 << ring); - if (obj->active) + if (obj->active) { + i915_gem_request_assign(&obj->last_read_req[ring], NULL); return; + } /* Bump our place on the bound list to keep it roughly in LRU order * so that we don't steal from recently used but inactive objects @@ -2396,6 +2397,7 @@ i915_gem_object_retire__read(struct drm_i915_gem_object *obj, int ring) list_move_tail(&vma->mm_list, &vma->vm->inactive_list); } + /* Only unref once we're on the inactive list. */ + i915_gem_request_assign(&obj->last_read_req[ring], NULL); i915_gem_request_assign(&obj->last_fenced_req, NULL); drm_gem_object_unreference(&obj->base); } Cheers, Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx