Re: [PATCH 1/6] drm/ttm: move the LRU into resource handling v4

Daniel Vetter <daniel@xxxxxxxx> · Wed, 23 Mar 2022 16:22:07 +0100

On Wed, Mar 23, 2022 at 02:44:17PM +0100, Christian König wrote:
> Am 23.03.22 um 14:36 schrieb Daniel Vetter:
> > On Wed, 23 Mar 2022 at 13:20, Christian König
> > <ckoenig.leichtzumerken@xxxxxxxxx> wrote:
> > > Am 23.03.22 um 12:59 schrieb Daniel Vetter:
> > > > On Mon, Mar 21, 2022 at 02:25:56PM +0100, Christian König wrote:
> > > > > This way we finally fix the problem that new resource are
> > > > > not immediately evict-able after allocation.
> > > > > 
> > > > > That has caused numerous problems including OOM on GDS handling
> > > > > and not being able to use TTM as general resource manager.
> > > > > 
> > > > > v2: stop assuming in ttm_resource_fini that res->bo is still valid.
> > > > > v3: cleanup kerneldoc, add more lockdep annotation
> > > > > v4: consistently use res->num_pages
> > > > > 
> > > > > Signed-off-by: Christian König <christian.koenig@xxxxxxx>
> > > > > Tested-by: Bas Nieuwenhuizen <bas@xxxxxxxxxxxxxxxxxxx>
> > > > > +/**
> > > > > + * struct ttm_lru_bulk_move
> > > > > + *
> > > > > + * @tt: first/last lru entry for resources in the TT domain
> > > > > + * @vram: first/last lru entry for resources in the VRAM domain
> > > > > + *
> > > > > + * Helper structure for bulk moves on the LRU list.
> > > > > + */
> > > > > +struct ttm_lru_bulk_move {
> > > > > +    struct ttm_lru_bulk_move_pos tt[TTM_MAX_BO_PRIORITY];
> > > > > +    struct ttm_lru_bulk_move_pos vram[TTM_MAX_BO_PRIORITY];
> > > > Not really needed, just a thought: Should we track the associated dma_resv
> > > > object here to make sure the locking is all done correctly (and also check
> > > > that the bulk move bo have the same dma_resv)? It wouldn't really be any
> > > > overhead for the !CONFIG_LOCKDEP case and we could sprinkle a lot more
> > > > dma_resv_held all over the place.
> > > You made a similar comment on the last revision and I already tried to
> > > play around with that idea a bit.
> > > 
> > > But I've completely abandoned that idea after realizing that the BOs in
> > > the bulk move actually don't need to have the same dma_resv object, nor
> > > do they all need to be locked.
> > Uh how does that work? If you evict that bo while someone else is
> > doing a bulk move, then at least the result might be rather funny, and
> > I have no idea how it could work.
> 
> The LRU is still protected by the common spinlock.
> 
> So that will synchronize any modification to both the bulk move structure as
> well as the individual list_heads making up the LRU.
> 
> > 
> > Like even if you then make the rule that you have to lock all bos for
> > the bulk move, the bo could still be moved independently, and that
> > would again break the bulk lru properties.
> > 
> > And if you do none of the above, there's no reason for that bo to have
> > a distinct dma_resv.
> > 
> > Like maybe the data structure wont fall apart, but semantically it
> > just doesn't make any sense to me to allow this. What would you want
> > to use this for?
> 
> Yeah, that's a good point.
> 
> It's not technically necessary as far as I can see, but I'm not sure if
> there is a valid use case either.

Yeah I think nothing obviously bad will happen to the list, but it will
very much no longer be an LRU. Which kinda defeats the point - there's
easier ways to not have an LRU like never updating anything :-)

Aside, this is actually what i915-gem folks did without realizing, so
that's also a bit why I think being strict here would be good. It's tricky
and if you're too clever it's way too easy to end up with something which
isn't anything useful anymore.
> 
> > > It just happens that amdgpu is currently using it that way, but I can't
> > > see any technical necessarily to restrict the bulk move like that.
> > Yeah we can do that later on in a follow up patch, or I figure out why
> > it's not a good idea :-) Just figured this might be good to lock down
> > before other drivers start adopting this.
> 
> I'm just wondering if it's really more defensive to restrict the handling
> like that.
> 
> On the other hand we can still lift the restriction when anybody really
> comes along with a valid use case.

Yeah if we do have a use-case we can lift this easily, it's just a few
dma_resv_assert_held. I don't think we need this included for merging, but
it'd be great as a follow-up patch while only amdgpu is using this.
-Daniel

> 
> Christian.
> 
> > -Daniel
> > 
> > > Regards,
> > > Christian.
> > > 
> > > 
> > > > -Daniel
> > 
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch