Re: [PATCH 1/5] drm/vmwgfx: unbind in vmw_ttm_unpopulate

Daniel Vetter <daniel@xxxxxxxx> · Tue, 27 Jul 2021 11:28:44 +0200

On Tue, Jul 27, 2021 at 06:03:12AM +1000, Dave Airlie wrote:
> On Tue, 27 Jul 2021 at 05:35, Christian König
> <ckoenig.leichtzumerken@xxxxxxxxx> wrote:
> >
> > Am 26.07.21 um 02:03 schrieb Dave Airlie:
> > > [SNIP]
> > >> But you know, normal human: Only equipped with one head and two hands
> > >> and not cloneable.
> > > I'm the same, but I'm not seeing where this problem happens at all, do
> > > we have any backtraces or demos for this?
> >
> > I've stumbled over this while working on some patches which accidentally
> > broke delayed delete and caused random memory corruption and was
> > wondering how that even happened in the first place.
> >
> > Having stale PTEs in the GART isn't a problem unless you break other
> > things. Which is also the reason why I haven't added a CC stable yet.
> >
> > > I split bind/unbind into the driver, but the driver should still
> > > always be moving things to unbound states before an unpopulate is
> > > called, is there a driver doing something unexpected here?
> >
> > Possible, I was only testing with amdgpu and the GART handling is rather
> > special there (which was one of the reasons why we did that in the first
> > place).
> >
> > > at worst I'd like to see a WARN_ON put in first and a test in igt that
> > > triggers it, because right now I'm not see that path through the
> > > drivers/ttm that leads to unpopulated pages ending up happening while
> > > bound.
> > >
> > >  From 5.14-rc3 unpopulate is called from ttm_bo_pipeline_gutting in
> > > non-ghost path and there is no unbind,
> > > pipeline gutting is called from evict/validate, when there is no
> > > placement suggested for the object, is this case getting hit somewhere
> > > without the driver having previously unbound things?
> >
> > Yes, that will hit absolutely. Most likely with VM page tables for
> > amdgpu which uses this.
> 
> My thing is here we moved binding/unbinding to the driver, if the
> driver has a bug
> I'd expect the fix to be in the driver side here. I think giving
> drivers control over something
> and enforcing it in the core/helpers is fine, but I don't think we
> should be adding back
> the midlayering.
> 
> > > ttm_tt_destroy_common: calls unpopulate, everyone calls this directly
> > > after unbinding
> > > ttm_tt_swapout: calls unpopulate, we don't swapout objects from TT
> > > directly, we should always unbind first, this used to have an assert
> > > against that,
> > > ttm_tt_populate: call unpopulate in failure path
> >
> > unbinding was moved into the driver, so ttm_tt_swapout() won't unbind
> > anything before calling unpopulate as far as I can see.
> 
> But we won't call swapout on BOs that aren't in SYSTEM and to be in SYSTEM,
> the bo would have to go through the driver move function which will unbind it.
> 
> >
> > > So any place I can see unpopulate getting called with a bound TT
> > > should be a bug, and fixed, we could protect against it better but I'm
> > > not seeing the need for this series to outright revert things back as
> > > helping.
> >
> > I'm not reverting this because it is offhand wrong, but making sure the
> > GART is clear before unpopulating the TT object sounds like the much
> > more natural thing to do and the state machine is something I certainly
> > don't see in the backend.
> >
> 
> I don't think that's the core's responsibility anymore, I'm fine with
> adding code and
> even an flag that says if the the TT is bound/unbound that unpopulate
> can WARN_ON()
> against so we get a backtrace and track down where something is
> getting unpopulated
> without going through the driver move function to be unbound.

Yeah I think sprinkling a few WARN_ON around to make sure drivers use the
functions correctly is the right thing. Re-midlayering because we don't
trust drivers to do things correctly isn't.

Aside from this principle I'll let you two duke out what's actually going
on :-)
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch