Re: [PATCH 0/4] TTM unlockable restartable LRU list iteration

Christian König <christian.koenig@xxxxxxx> · Fri, 16 Feb 2024 15:00:11 +0100

Am 16.02.24 um 14:13 schrieb Thomas Hellström:
This patch-set is a prerequisite for a standalone TTM shrinker
and for exhaustive TTM eviction using sleeping dma_resv locks,
which is the motivation it.

Currently when unlocking the TTM lru list lock, iteration needs
to be restarted from the beginning, rather from the next LRU list
node. This can potentially be a big problem, because if eviction
or shrinking fails for whatever reason after unlock, restarting
is likely to cause the same failure over and over again.

Oh, yes please. I have been working on that problem before as well, but 
wasn't able to come up with something working.

There are various schemes to be able to continue the list
iteration from where we left off. One such scheme used by the
GEM LRU list traversal is to pull items already considered off
the LRU list and reinsert them when iteration is done.
This has the drawback that concurrent list iteration doesn't see
the complete list (which is bad for exhaustive eviction) and also
doesn't lend itself well to bulk-move sublists since these will
be split in the process where items from those lists are
temporarily pulled from the list and moved to the list tail.

Completely agree that this is not a desirable solution.

The approach taken here is that list iterators insert themselves
into the list next position using a special list node. Iteration
is then using that list node as starting point when restarting.
Concurrent iterators just skip over the special list nodes.

This is implemented in patch 1 and 2.

For bulk move sublist the approach is the same, but when a bulk
move sublist is moved to the tail, the iterator is also moved,
causing us to skip parts of the list. That is undesirable.
Patch 3 deals with that, and when iterator detects it is
traversing a sublist, it inserts a second restarting point just
after the sublist and if the sublist is moved to the tail,
it just uses the second restarting point instead.

This is implemented in patch 3.

Interesting approach, that is probably even better than what I tried.

My approach was basically to not only lock the current BO, but also the 
next one. Since only a locked BO can move on the LRU we effectively 
created an anchor.

Before I dig into the code a couple of questions:
1. How do you distinct BOs and iteration anchors on the LRU?
2. How do you detect that a bulk list moved on the LRU?
3. How do you remove the iteration anchors from the bulk list?

Regards,
Christian.

The restartable property is used in patch 4 to restart swapout if
needed, but the main purpose is this paves the way for
shrinker- and exhaustive eviction.

Cc: Christian König <christian.koenig@xxxxxxx>
Cc: <dri-devel@xxxxxxxxxxxxxxxxxxxxx>

Thomas Hellström (4):
   drm/ttm: Allow TTM LRU list nodes of different types
   drm/ttm: Use LRU hitches
   drm/ttm: Consider hitch moves within bulk sublist moves
   drm/ttm: Allow continued swapout after -ENOSPC falure

  drivers/gpu/drm/ttm/ttm_bo.c       |   1 +
  drivers/gpu/drm/ttm/ttm_device.c   |  33 +++--
  drivers/gpu/drm/ttm/ttm_resource.c | 202 +++++++++++++++++++++++------
  include/drm/ttm/ttm_resource.h     |  91 +++++++++++--
  4 files changed, 267 insertions(+), 60 deletions(-)