Re: [PATCH 0/4] TTM unlockable restartable LRU list iteration

Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> · Thu, 29 Feb 2024 18:34:23 +0100

Hi, Christian.

Thanks for having a look.

On Thu, 2024-02-29 at 16:08 +0100, Christian König wrote:
> Am 16.02.24 um 15:20 schrieb Thomas Hellström:
> [SNIP]
> > > My approach was basically to not only lock the current BO, but
> > > also
> > > the
> > > next one. Since only a locked BO can move on the LRU we
> > > effectively
> > > created an anchor.
> > > 
> > > Before I dig into the code a couple of questions:
> > These are described in the patches but brief comments inline.
> > 
> > > 1. How do you distinct BOs and iteration anchors on the LRU?
> > Using a struct ttm_lru_item, containing a struct list_head and the
> > type. List nodes embeds this instead of a struct list_head. This is
> > larger than the list head but makes it explicit what we're doing.
> 
> Need to look deeper into the code of this, but it would be nice if we
> could abstract that better somehow.

Do you have any specific concerns or improvements in mind? I think
patch 1 and 2 are pretty straigthforward. Patch 3 is indeed a bit
hairy.

> 
> > > 2. How do you detect that a bulk list moved on the LRU?
> > An age u64 counter on the bulk move that we're comparing against.
> > It's
> > updated each time it moves.
> > 
> > 
> > > 3. How do you remove the iteration anchors from the bulk list?
> > A function call at the end of iteration, that the function
> > iterating is
> > requried to call.
> 
> Thinking quite a bit about that in the last week and I came to the 
> conclusion that this might be overkill.
> 
> All BOs in a bulk share the same reservation object. So when you 
> acquired one you can just keep the dma-resv locked even after
> evicting 
> the BO.
> 
> Since moving BOs requires locking the dma-resv object the whole
> problem 
> then just boils down to a list_for_each_element_safe().
> 
> That's probably a bit simpler than doing the add/remove dance.

I think the problem with the "lock the next object" approach is that
there are situations that it might not work. First, where not asserting
anywhere that all bulk move resource have the same lock, and after
individualization they certainly don't. (I think I had a patch
somewhere to try to enforce that, but I don't think it ever got
reviewed). I tried to sort out the locking rules at one point for
resources switching bos to ghost object but I long since forgot those.

I guess it all boils down to the list elements being resources, not
bos.

Also I'm concerned about keeping a resv held for a huge number of
evictions will block out a higher priority ticket for a substantial
amount of time.

I think while the suggested solution here might be a bit of an
overkill, it's simple enough to understand, but the locking
implications of resources switching resvs arent.

But please let me know how we should proceed here. This is a blocker
for other pending work we have.

/Thomas

> 
> Regards,
> Christian.
> 
> > 
> > 
> > /Thomas
> > 
> > > Regards,
> > > Christian.
> > > 
> > > > The restartable property is used in patch 4 to restart swapout
> > > > if
> > > > needed, but the main purpose is this paves the way for
> > > > shrinker- and exhaustive eviction.
> > > > 
> > > > Cc: Christian König<christian.koenig@xxxxxxx>
> > > > Cc:<dri-devel@xxxxxxxxxxxxxxxxxxxxx>
> > > > 
> > > > Thomas Hellström (4):
> > > >     drm/ttm: Allow TTM LRU list nodes of different types
> > > >     drm/ttm: Use LRU hitches
> > > >     drm/ttm: Consider hitch moves within bulk sublist moves
> > > >     drm/ttm: Allow continued swapout after -ENOSPC falure
> > > > 
> > > >    drivers/gpu/drm/ttm/ttm_bo.c       |   1 +
> > > >    drivers/gpu/drm/ttm/ttm_device.c   |  33 +++--
> > > >    drivers/gpu/drm/ttm/ttm_resource.c | 202
> > > > +++++++++++++++++++++++-
> > > > -----
> > > >    include/drm/ttm/ttm_resource.h     |  91 +++++++++++--
> > > >    4 files changed, 267 insertions(+), 60 deletions(-)
> > > >