On Tue, Jul 27, 2021 at 06:03:12AM +1000, Dave Airlie wrote: > On Tue, 27 Jul 2021 at 05:35, Christian König > <ckoenig.leichtzumerken@xxxxxxxxx> wrote: > > > > Am 26.07.21 um 02:03 schrieb Dave Airlie: > > > [SNIP] > > >> But you know, normal human: Only equipped with one head and two hands > > >> and not cloneable. > > > I'm the same, but I'm not seeing where this problem happens at all, do > > > we have any backtraces or demos for this? > > > > I've stumbled over this while working on some patches which accidentally > > broke delayed delete and caused random memory corruption and was > > wondering how that even happened in the first place. > > > > Having stale PTEs in the GART isn't a problem unless you break other > > things. Which is also the reason why I haven't added a CC stable yet. > > > > > I split bind/unbind into the driver, but the driver should still > > > always be moving things to unbound states before an unpopulate is > > > called, is there a driver doing something unexpected here? > > > > Possible, I was only testing with amdgpu and the GART handling is rather > > special there (which was one of the reasons why we did that in the first > > place). > > > > > at worst I'd like to see a WARN_ON put in first and a test in igt that > > > triggers it, because right now I'm not see that path through the > > > drivers/ttm that leads to unpopulated pages ending up happening while > > > bound. > > > > > > From 5.14-rc3 unpopulate is called from ttm_bo_pipeline_gutting in > > > non-ghost path and there is no unbind, > > > pipeline gutting is called from evict/validate, when there is no > > > placement suggested for the object, is this case getting hit somewhere > > > without the driver having previously unbound things? > > > > Yes, that will hit absolutely. Most likely with VM page tables for > > amdgpu which uses this. > > My thing is here we moved binding/unbinding to the driver, if the > driver has a bug > I'd expect the fix to be in the driver side here. I think giving > drivers control over something > and enforcing it in the core/helpers is fine, but I don't think we > should be adding back > the midlayering. > > > > ttm_tt_destroy_common: calls unpopulate, everyone calls this directly > > > after unbinding > > > ttm_tt_swapout: calls unpopulate, we don't swapout objects from TT > > > directly, we should always unbind first, this used to have an assert > > > against that, > > > ttm_tt_populate: call unpopulate in failure path > > > > unbinding was moved into the driver, so ttm_tt_swapout() won't unbind > > anything before calling unpopulate as far as I can see. > > But we won't call swapout on BOs that aren't in SYSTEM and to be in SYSTEM, > the bo would have to go through the driver move function which will unbind it. > > > > > > So any place I can see unpopulate getting called with a bound TT > > > should be a bug, and fixed, we could protect against it better but I'm > > > not seeing the need for this series to outright revert things back as > > > helping. > > > > I'm not reverting this because it is offhand wrong, but making sure the > > GART is clear before unpopulating the TT object sounds like the much > > more natural thing to do and the state machine is something I certainly > > don't see in the backend. > > > > I don't think that's the core's responsibility anymore, I'm fine with > adding code and > even an flag that says if the the TT is bound/unbound that unpopulate > can WARN_ON() > against so we get a backtrace and track down where something is > getting unpopulated > without going through the driver move function to be unbound. Yeah I think sprinkling a few WARN_ON around to make sure drivers use the functions correctly is the right thing. Re-midlayering because we don't trust drivers to do things correctly isn't. Aside from this principle I'll let you two duke out what's actually going on :-) -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch