Re: bulk_move in ttm_resource manager

Christian König <christian.koenig@xxxxxxx> · Thu, 5 Oct 2023 12:44:30 +0200

Am 05.10.23 um 10:36 schrieb Thomas Hellström:
On Wed, 2023-10-04 at 14:44 +0200, Christian König wrote:
Am 04.10.23 um 09:17 schrieb Thomas Hellström:
On Wed, 2023-10-04 at 03:52 +0000, Zeng, Oak wrote:
Hi Christian,

As a follow up to this thread:
https://www.spinics.net/lists/dri-devel/msg410740.html, I started
the
work of moving the lru out of ttm_resource_manager and make it a
common library for both ttm and svm. While look into the details
of
the bulk_move in ttm resource manager, I found a potential
problem:

For simplicity, let’s say we only have one memory type and one
priority, so ttm resource manager only maintains one global lru
list.
Let’s say this list has 10 nodes, node1 to node10.

But the lru_bulk_move is per vm. Let’s say vm1 has a bulk_move
covering node range [node4, node7] and vm2 has a bulk_move
covering
node range [node6, node9]. Notice those two range has an overlap.
Since two vm can simultaneously add nodes to lru, I think this
scenario can happen.
That can't happen. See what ttm_resource_move_to_lru_tail() does when
the BO has a bulk move associated with it.

Now if we perform a bulk move for vm1, moving [node4, node7] to
the
tail of the lru list. The lru after this bulk move will be:
node1,
node2, node3,node8,node9, node10, node4, node5, node6, node7. Now
notice that for vm2’s bulk_move, the first pointer  (pointing to
node6) is actually after the last pointer (pointing to node9),
which
doesn’t make sense.

Is this a real problem? As I understand it, with this issue, we
only
mess up the lru list order, but there won’t be any functional
problem. If it is a real problem, should we make the bulk_move
global
instead of per vm based?

Thanks,
Oak

FWIW I have a patch set that converts the TTM bulk move code to
using
sublists; a list item is either a resource or a sublist, and when
performing a bulk move essentially the sublist is moved. Bumping
resource LRU within a VM would touch only the sublist.
That sounds like my very first attempt at bulk moves which we
abandoned
for various reasons.

That's easily >5years ago, but the history of that should still be on
the mailing list if I'm not completely mistaken.
This here?

https://lists.freedesktop.org/archives/amd-gfx/2018-August/025016.html

No, in that case it's very different. Or is it an even earlier version?

No, that was even earlier. Basically the first version I discussed with 
Chunming.

The issue was simple that when you have a hierarchically LRU you also 
need a multi layer cursor and make sure that you have a single lock for 
everything.

This is multi layer cursor is complicated to implement and contradicts 
the idea that we want to walk the LRU with anchors and dropping locks in 
between (not that we ever implemented that, but it would still be nice 
to have).

In general when you use some hierarchical LRU you just move the 
complexity from the insert function to the walk function. And I don't 
think we would win much with that.

Regards,
Christian.

/Thomas

Regards,
Christian.

Currently functionality and TTM API is essentially the same but
when
experimenting with LRU traversal for exhaustive WW-locking eviction
this concept was easier to use. Also hopefully this would reduce
fragility and improve understanding since a scenario like the above
could really never happen...

Let me know if I should send it out as an RFC.

Code is here:
https://gitlab.freedesktop.org/drm/xe/kernel/-/merge_requests/351/commits

/Thomas