On 11/23/2015 02:34 AM, Tvrtko Ursulin wrote:
On 20/11/15 08:31, Daniel Vetter wrote: > On Thu, Nov 19, 2015 at 04:10:26PM -0800, yu.dai@xxxxxxxxx wrote: >> From: Alex Dai <yu.dai@xxxxxxxxx> >> >> There is a memory leak warning message from i915_gem_context_clean >> when GuC submission is enabled. The reason is that when LRC is >> released, its ppgtt could be still referenced. The assumption that >> all VMAs are unbound during release of LRC is not true. >> >> v1: Move the code inside i915_gem_context_clean() to where ppgtt is >> released because it is not cleaning context anyway but ppgtt. >> >> Signed-off-by: Alex Dai <yu.dai@xxxxxxxxx> > > retire__read drops the ctx (and hence ppgtt) reference too early, > resulting in us hitting the WARNING. See the giant thread with Tvrtko, > Chris and me: > > http://www.spinics.net/lists/intel-gfx/msg78918.html > > Would be great if someone could test the diff I posted in there. It doesn't work - I have posted my IGT snippet which I thought explained it.
I thought moving the VMA list clean up into i915_ppgtt_release() should work. However, it creates a chicken & egg problem. ppgtt_release() rely on vma_unbound() to be called first to decrease its refcount. So calling vma_unbound() inside ppgtt_release() is not right.
Problem req unreference in obj->active case. When it hits that path it will not move the VMA to inactive and the intel_execlists_retire_requests will be the last unreference from the retire worker which will trigger the WARN.
I still think the problem comes from the assumption that when lrc is released, its all VMAs should be unbound. Precisely I mean the comments made for i915_gem_context_clean() - "This context is going away and we need to remove all VMAs still around." Really the lrc life cycle is different from ppgtt / VMAs. Check the line after i915_gem_context_clean(). It is ppgtt_put(). In the case lrc is freed early, It won't release ppgtt anyway because it is still referenced by VMAs. An it will be freed when no ref of GEM obj.
I posted an IGT which hits that -> http://patchwork.freedesktop.org/patch/65369/ And posted one give up on the active VMA mem leak patch -> http://patchwork.freedesktop.org/patch/65529/
This patch will silent the warning. But I think the i915_gem_context_clean() itself is unnecessary. I don't see any issue by deleting it. The check of VMA list is inside ppgtt_release() and the unbound should be aligned to GEM obj's life cycle but not lrc life cycle.
I have no idea yet of GuC implications, I just spotted this parallel thread. And Mika has proposed something interesting - that we could just clean up the active VMA in context cleanup since we know it is a false one. However, again I don't know how that interacts with the GuC. Surely it cannot be freeing the context with stuff genuinely still active in the GuC?
There is no interacts with GuC though. Just very easy to see the warning when GuC is enabled, says when run gem_close_race. The reason is that GuC does not use the execlist_queue (execlist_retired_req_list) which is deferred to retire worker. Same as ring submission mode, when GuC is enabled, whenever driver submits a new batch, it will try to release previous request. I don't know why intel_execlists_retire_requests is not called for this case. Probably because of the unpin. Deferring the retirement may just hide the issue. I bet you will see the warning more often if you change i915_gem_retire_requests_ring() to i915_gem_retire_requests() in i915_gem_execbuffer_reserve().
Thanks, Alex _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx