Re: [PATCH v1] drm/i915: Fix a false alert of memory leak when free LRC

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 23/11/15 22:30, Yu Dai wrote:
On 11/23/2015 02:34 AM, Tvrtko Ursulin wrote:
On 20/11/15 08:31, Daniel Vetter wrote:
> On Thu, Nov 19, 2015 at 04:10:26PM -0800, yu.dai@xxxxxxxxx wrote:
>> From: Alex Dai <yu.dai@xxxxxxxxx>
>>
>> There is a memory leak warning message from i915_gem_context_clean
>> when GuC submission is enabled. The reason is that when LRC is
>> released, its ppgtt could be still referenced. The assumption that
>> all VMAs are unbound during release of LRC is not true.
>>
>> v1: Move the code inside i915_gem_context_clean() to where ppgtt is
>> released because it is not cleaning context anyway but ppgtt.
>>
>> Signed-off-by: Alex Dai <yu.dai@xxxxxxxxx>
>
> retire__read drops the ctx (and hence ppgtt) reference too early,
> resulting in us hitting the WARNING. See the giant thread with Tvrtko,
> Chris and me:
>
> http://www.spinics.net/lists/intel-gfx/msg78918.html
>
> Would be great if someone could test the diff I posted in there.

It doesn't work - I have posted my IGT snippet which I thought
explained it.

I thought moving the VMA list clean up into i915_ppgtt_release() should
work. However, it creates a chicken & egg problem. ppgtt_release() rely
on vma_unbound() to be called first to decrease its refcount. So calling
vma_unbound() inside ppgtt_release() is not right.
Problem req unreference in obj->active case. When it hits that path it
will not move the VMA to inactive and the
intel_execlists_retire_requests will be the last unreference from the
retire worker which will trigger the WARN.

I still think the problem comes from the assumption that when lrc is
released, its all VMAs should be unbound. Precisely I mean the comments
made for i915_gem_context_clean() - "This context is going away and we
need to remove all VMAs still around." Really the lrc life cycle is
different from ppgtt / VMAs. Check the line after
i915_gem_context_clean(). It is ppgtt_put(). In the case lrc is freed
early, It won't release ppgtt anyway because it is still referenced by
VMAs. An it will be freed when no ref of GEM obj.

Yes, but the last is just a consequence of how the code was laid out, not a hard requirement as far as I understand it.

When context destructor runs that means two things:

1. GPU is done with all VMAs belonging to the VM.
2. USerspace also cannot reach them any more either.

So VMAs have no reason to exist beyond that point. If they do, they can be left dangling under certain circumstances. See below.

I posted an IGT which hits that ->
http://patchwork.freedesktop.org/patch/65369/

And posted one give up on the active VMA mem leak patch ->
http://patchwork.freedesktop.org/patch/65529/

This patch will silent the warning. But I think the
i915_gem_context_clean() itself is unnecessary. I don't see any issue by
deleting it. The check of VMA list is inside ppgtt_release() and the
unbound should be aligned to GEM obj's life cycle but not lrc life cycle.

i915_gem_context_clean solves a memory leak which is explained in the commit.

If there is an extra reference on the GEM obj, like in the flink case, VMAs will not get unbound.

Apart from various IGTs, you can also demonstrate this leak with fbcon and Xorg. Every time you re-start Xorg one VMA will be leaked since it imports the fbcon fb via flink.

Second part of the story is that the WARN in i915_gem_context_clean is wrong because of how retirement works. So if we removed the WARN, cleanup still has some value to avoid memory leak in the above described Xorg case, or in any case where flink is in the picture and VMAs have been retired to inactive.

Bug left standing would be that if the VMA happens to be flinked and on the active list, because of say being shared between rings and contexts, it would still be leaked. But it is less leaking than without the cleanup.

I proposed another way of fixing that: http://patchwork.freedesktop.org/patch/65861/

I have no idea yet of GuC implications, I just spotted this parallel
thread.

And Mika has proposed something interesting - that we could just clean
up the active VMA in context cleanup since we know it is a false one.

However, again I don't know how that interacts with the GuC. Surely it
cannot be freeing the context with stuff genuinely still active in the
GuC?


There is no interacts with GuC though. Just very easy to see the warning
when GuC is enabled, says when run gem_close_race. The reason is that
GuC does not use the execlist_queue (execlist_retired_req_list) which is
deferred to retire worker. Same as ring submission mode, when GuC is
enabled, whenever driver submits a new batch, it will try to release
previous request. I don't know why  intel_execlists_retire_requests is
not called for this case. Probably because of the unpin. Deferring the
retirement may just hide the issue. I bet you will see the warning more
often if you change i915_gem_retire_requests_ring() to
i915_gem_retire_requests() in i915_gem_execbuffer_reserve().

Hmmm.. gem_close_race uses flink, but why would it trigger context destruction with active VMAs? How does the backtrace looks like from the context cleanup WARN?

Regards,

Tvrtko





_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux