Re: [PATCH v4] drm/i915: Clean up associated VMAs on context destruction

Daniel Vetter <daniel@xxxxxxxx> · Tue, 6 Oct 2015 11:28:49 +0200

On Tue, Oct 06, 2015 at 10:04:31AM +0100, Chris Wilson wrote:
> On Tue, Oct 06, 2015 at 10:58:01AM +0200, Daniel Vetter wrote:
> > On Mon, Oct 05, 2015 at 01:26:36PM +0100, Tvrtko Ursulin wrote:
> > > From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
> > > 
> > > Prevent leaking VMAs and PPGTT VMs when objects are imported
> > > via flink.
> > > 
> > > Scenario is that any VMAs created by the importer will be left
> > > dangling after the importer exits, or destroys the PPGTT context
> > > with which they are associated.
> > > 
> > > This is caused by object destruction not running when the
> > > importer closes the buffer object handle due the reference held
> > > by the exporter. This also leaks the VM since the VMA has a
> > > reference on it.
> > > 
> > > In practice these leaks can be observed by stopping and starting
> > > the X server on a kernel with fbcon compiled in. Every time
> > > X server exits another VMA will be leaked against the fbcon's
> > > frame buffer object.
> > > 
> > > Also on systems where flink buffer sharing is used extensively,
> > > like Android, this leak has even more serious consequences.
> > > 
> > > This version is takes a general approach from the  earlier work
> > > by Rafael Barbalho (drm/i915: Clean-up PPGTT on context
> > > destruction) and tries to incorporate the subsequent discussion
> > > between Chris Wilson and Daniel Vetter.
> > > 
> > > v2:
> > > 
> > > Removed immediate cleanup on object retire - it was causing a
> > > recursive VMA unbind via i915_gem_object_wait_rendering. And
> > > it is in fact not even needed since by definition context
> > > cleanup worker runs only after the last context reference has
> > > been dropped, hence all VMAs against the VM belonging to the
> > > context are already on the inactive list.
> > > 
> > > v3:
> > > 
> > > Previous version could deadlock since VMA unbind waits on any
> > > rendering on an object to complete. Objects can be busy in a
> > > different VM which would mean that the cleanup loop would do
> > > the wait with the struct mutex held.
> > > 
> > > This is an even simpler approach where we just unbind VMAs
> > > without waiting since we know all VMAs belonging to this VM
> > > are idle, and there is nothing in flight, at the point
> > > context destructor runs.
> > > 
> > > v4:
> > > 
> > > Double underscore prefix for __915_vma_unbind_no_wait and a
> > > commit message typo fix. (Michel Thierry)
> > > 
> > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
> > > Testcase: igt/gem_ppgtt.c/flink-and-exit-vma-leak
> > > Reviewed-by: Michel Thierry <michel.thierry@xxxxxxxxx>
> > > Cc: Daniel Vetter <daniel.vetter@xxxxxxxx>
> > > Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> > > Cc: Rafael Barbalho <rafael.barbalho@xxxxxxxxx>
> > > Cc: Michel Thierry <michel.thierry@xxxxxxxxx>
> > 
> > Queued for -next, thanks for the patch.
> 
> Please no, it's an awful patch and does not even fix the root cause of
> the leak (that the vma are not closed when the handle is).

It's a lose-lose situation for me as maintainer (and this holds in general
really, not just for this patch):
- Either I wield my considerable maintainer powers and force proper
  solution, which will piss of a lot of people.
- Or I just merge intermediate stuff and piss of another set of people
  (including likely our all future selves because we're slowly digging a
  tech debt grave).

What I can't do with maintainer fu is force collaboration, I can only try
to piss off everyone equally. And today the die rolled a "merge".

Oh well.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx