On Fri, 23 Jul 2021 at 19:40, Christian König <ckoenig.leichtzumerken@xxxxxxxxx> wrote: > > Am 23.07.21 um 11:21 schrieb Daniel Vetter: > > On Fri, Jul 23, 2021 at 11:13 AM Christian König > > <ckoenig.leichtzumerken@xxxxxxxxx> wrote: > >> Am 23.07.21 um 10:47 schrieb Daniel Vetter: > >>> On Thu, Jul 22, 2021 at 02:41:23PM +0200, Christian König wrote: > >>>> Doing this in vmw_ttm_destroy() is to late. > >>>> > >>>> It turned out that this is not a good idea at all because it leaves pointers > >>>> to freed up system memory pages in the GART tables of the drivers. > >>> So I wanted to review this series, and I can't reconcile your claim here > >>> with the demidlayering Dave has done. The driver patches here don't > >>> ouright undo what Dave has done, but that means the bug has been > >>> preexisting since forever (or is due to some other change?), and your > >>> commit message is a bit confusing here. > >>> > >>> The final patch just undoes the demidlayering from Dave, and I really > >>> don't see where there's even a functional change there. > >>> > >>> And even these patches here don't really change a hole lot with the > >>> calling sequence for at least final teardown: ttm_tt_destroy_common calls > >>> ttm_tt_unpopulate as the first thing, so at least there there's no change. > >>> > >>> Can you pls elaborate more clearly what exactly you're fixing and what > >>> exactly needs to be reordered and where this bug is from (commit sha1)? As > >>> is I'm playing detective and the evidence presented is extremely since and > >>> I can't reconcile it at all. > >>> > >>> I mean I know you don't like typing commit message and documentation, but > >>> it does get occasionally rather frustrating on the reviewer side if I have > >>> to interpolate between some very sparse hints for this stuff :-/ > >> Yeah, when have seen the history it's rather obvious what's wrong here > >> and I expected Dave to review it himself. > >> > >> Previously we had three states in TTM for a tt object: Allocated -> > >> Populated -> Bound which on destruction where done in the order unbind > >> -> unpopulate -> free. > >> > >> Dave moved handling of the bound state into the drivers since it is > >> basically a driver decision and not a TTM decision what should be bound > >> and what not (that part perfectly makes sense). > > I haven't reviewed all the patches from Dave, only the one you pointed > > at (in the last patch). And that one I still can't match up with your > > description. If there's other commits relevant, can you pls dig them > > out? > > > > Like it all makes sense what you're saying and matches the code, I > > just can't match it up with the commit you're referencing. > > That is the patch directly following the one I've mentioned: > > commit 37bff6542c4e140a11657406c1bab50a40329cc1 > Author: Dave Airlie <airlied@xxxxxxxxxx> > Date: Thu Sep 17 13:24:50 2020 +1000 > > drm/ttm: move unbind into the tt destroy. > > This moves unbind into the driver side on destroy paths. > > I will add a Fixes tag to make that clear. > > But this patch also just moves the undbind from the TTM destroy path to > the driver destroy path. > > To be honest I'm not 100% sure either when the when the unbind moved > from the unpopulate path into the destroy path, but I think that this > wasn't always the case. Let me try to dig that up. > > >> The problem is that he also moved doing the unbind into the free > >> callback instead of the unpopulate callback. This result in stale page > >> pointers in the GART if that unpopulate operation isn't immediately > >> followed by a free. > >> > >> Thinking more about it if we do populated->unpopulated->populated then > >> we would also have stale pointers to the old pages which is even worse. > >> > >> This is also not de-midlayering since we already have a proper > >> ttm_tt_init()/ttm_tt_fini() functions which should work nicely for the > >> tt object. > > Well you're last patch moves the ttm_tt_destroy_common stuff back into > > ttm, which kinda is de-demidlayering. So I'm confused. > > Ah, yes that is correct. I've also considered to move this in > ttm_tt_fini instead of there. > > But that would be a larger change and I wanted to fix the problem at > hand first, potentially even adding a CC stable tag. > > > Other bit: I think it'd be good to document this properly in the > > callbacks, and maybe ideally go about and kerneldoc-ify the entire > > ttm_tt.h header. Otherwise when we eventually (never?) get around to > > that, everyone has forgotten these semantic details and issues again. > > Already working towards including more of the TTM headers and code files > in kerneldoc. But not quite there yet. > > But you know, normal human: Only equipped with one head and two hands > and not cloneable. I'm the same, but I'm not seeing where this problem happens at all, do we have any backtraces or demos for this? I split bind/unbind into the driver, but the driver should still always be moving things to unbound states before an unpopulate is called, is there a driver doing something unexpected here? at worst I'd like to see a WARN_ON put in first and a test in igt that triggers it, because right now I'm not see that path through the drivers/ttm that leads to unpopulated pages ending up happening while bound. >From 5.14-rc3 unpopulate is called from ttm_bo_pipeline_gutting in non-ghost path and there is no unbind, pipeline gutting is called from evict/validate, when there is no placement suggested for the object, is this case getting hit somewhere without the driver having previously unbound things? ttm_tt_destroy_common: calls unpopulate, everyone calls this directly after unbinding ttm_tt_swapout: calls unpopulate, we don't swapout objects from TT directly, we should always unbind first, this used to have an assert against that, ttm_tt_populate: call unpopulate in failure path So any place I can see unpopulate getting called with a bound TT should be a bug, and fixed, we could protect against it better but I'm not seeing the need for this series to outright revert things back as helping. Dave.