Am 01.03.24 um 14:45 schrieb Thomas Hellström:
On Thu, 2024-02-29 at 18:34 +0100, Thomas Hellström wrote:
Hi, Christian.
Thanks for having a look.
On Thu, 2024-02-29 at 16:08 +0100, Christian König wrote:
Am 16.02.24 um 15:20 schrieb Thomas Hellström:
[SNIP]
My approach was basically to not only lock the current BO, but
also
the
next one. Since only a locked BO can move on the LRU we
effectively
created an anchor.
Before I dig into the code a couple of questions:
These are described in the patches but brief comments inline.
1. How do you distinct BOs and iteration anchors on the LRU?
Using a struct ttm_lru_item, containing a struct list_head and
the
type. List nodes embeds this instead of a struct list_head. This
is
larger than the list head but makes it explicit what we're doing.
Need to look deeper into the code of this, but it would be nice if
we
could abstract that better somehow.
Do you have any specific concerns or improvements in mind? I think
patch 1 and 2 are pretty straigthforward. Patch 3 is indeed a bit
hairy.
Yeah, seen that as well. No idea of hand how to improve.
Amar should have time to give the patches a more in deep review, maybe
he has an idea.
2. How do you detect that a bulk list moved on the LRU?
An age u64 counter on the bulk move that we're comparing against.
It's
updated each time it moves.
3. How do you remove the iteration anchors from the bulk list?
A function call at the end of iteration, that the function
iterating is
requried to call.
Thinking quite a bit about that in the last week and I came to the
conclusion that this might be overkill.
All BOs in a bulk share the same reservation object. So when you
acquired one you can just keep the dma-resv locked even after
evicting
the BO.
Since moving BOs requires locking the dma-resv object the whole
problem
then just boils down to a list_for_each_element_safe().
That's probably a bit simpler than doing the add/remove dance.
I think the problem with the "lock the next object" approach
Stop stop, you misunderstood me. I was not suggesting to use the lock
the next object approach, this anchor approach here is certainly better.
I just wanted to note that we most likely don't need to insert a second
anchor for bulk moves.
Basically my idea is that we start to use the drm_exec object to lock
BOs and those BOs stay locked until everything is completed.
That also removes the problem that a BO might be evicted just to be
moved back in again by a concurrent submission.
is that
there are situations that it might not work. First, where not
asserting
anywhere that all bulk move resource have the same lock,
Daniel actually wanted that I add such an assert, I just couldn't find a
way to easily do this back then.
But since I reworked the bulk move since then it should now be possible.
and after
individualization they certainly don't.
Actually when they are individualized for freeing they shouldn't be part
of any bulk any more.
(I think I had a patch
somewhere to try to enforce that, but I don't think it ever got
reviewed). I tried to sort out the locking rules at one point for
resources switching bos to ghost object but I long since forgot
those.
I guess it all boils down to the list elements being resources, not
bos.
Also I'm concerned about keeping a resv held for a huge number of
evictions will block out a higher priority ticket for a substantial
amount of time.
I think while the suggested solution here might be a bit of an
overkill, it's simple enough to understand, but the locking
implications of resources switching resvs arent.
But please let me know how we should proceed here. This is a blocker
for other pending work we have.
Actually some more issues with the locking approach would be with the
intended use-cases I was planning to use this for.
For example the exhaustive eviction where we regularly unlock the
lru_lock to take the bo lock. If we need to do that for the first
element of a bulk_move list, we can't have the bo lock of the next
element when we unlock the list. For part of the list that is not a
bulk sublist, this also doesn't work AFAICT.
Well when we drop the LRU lock we should always have the anchor on the
LRU before the element we try to lock.
This way we actually don't have to move the anchor unless we found a BO
which we don't want to evict.
E.g. something like
Head -> anchor -> BO1 -> BO2 -> BO3 -> BO4
And we Evict BO1, BO2 and then find that BO3 doesn't match the
allocation pattern we need so only then is the anchor moved after BO3:
Head -> BO3 -> anchor -> BO4....
And when we moved inside a bulk with an anchor we have already locked at
least one BO of the bulk, so locking the next one is a no-op.
And finally for the tt shinking that's been pending for quite some
time, the last comment that made me temporarily shelf is was that we
should expose the lru traversal to the drivers, and the drivers
implement the shrinkers with TTM helpers, rather than having TTM being
the middle layer. So I think exposing the LRU traversal to drivers will
probably end up having pretty weird semantics if it sometimes locks or
requiring locking of the next object while traversing.
Yeah, I was just yesterday talking about that with Amar and putting him
on the task to look into tt shrinking.
And completely agree that providing the necessary toolbox for eviction
is a better approach than burying the eviction deep into the TTM logic.
But regardless of how this is solved, since I think we are agreeing
that the functionality itself is useful and needed, could we perhaps
use this implementation that is easy to verify that it works, and I
will i no way stand in the way if it turns out you come up with
something nicer. I've been thinking a bit of how to make a better
approach out of patch 3, and a possible alternative that I could
prototype would be to register list cursors that traverse a bulk
sublist with the bulk move structure using a list. At destruction of
either list cursors or bulk moves either can unregister, and on bulk
list bumping the list is traversed and the cursor is moved to the end
of the list. Probably the same amount of code but will look nicer.
Yeah, I'm just not sure if this handling here will be so simple with
multiple anchors. That sounds very fragile.
Regards,
Christian.
/Thomas
/Thomas
Regards,
Christian.
/Thomas
Regards,
Christian.
The restartable property is used in patch 4 to restart
swapout
if
needed, but the main purpose is this paves the way for
shrinker- and exhaustive eviction.
Cc: Christian König<christian.koenig@xxxxxxx>
Cc:<dri-devel@xxxxxxxxxxxxxxxxxxxxx>
Thomas Hellström (4):
drm/ttm: Allow TTM LRU list nodes of different types
drm/ttm: Use LRU hitches
drm/ttm: Consider hitch moves within bulk sublist moves
drm/ttm: Allow continued swapout after -ENOSPC falure
drivers/gpu/drm/ttm/ttm_bo.c | 1 +
drivers/gpu/drm/ttm/ttm_device.c | 33 +++--
drivers/gpu/drm/ttm/ttm_resource.c | 202
+++++++++++++++++++++++-
-----
include/drm/ttm/ttm_resource.h | 91 +++++++++++--
4 files changed, 267 insertions(+), 60 deletions(-)