Re: [PATCH v4] drm/ttm: Clarify that the TTM_PL_SYSTEM is under TTMs control

Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> · Sat, 13 Nov 2021 12:26:01 +0100

Hi, Zack,

On 11/11/21 17:44, Zack Rusin wrote:
On Wed, 2021-11-10 at 09:50 -0500, Zack Rusin wrote:
TTM takes full control over TTM_PL_SYSTEM placed buffers. This makes
driver internal usage of TTM_PL_SYSTEM prone to errors because it
requires the drivers to manually handle all interactions between TTM
which can swap out those buffers whenever it thinks it's the right
thing to do and driver.

CPU buffers which need to be fenced and shared with accelerators
should
be placed in driver specific placements that can explicitly handle
CPU/accelerator buffer fencing.
Currently, apart, from things silently failing nothing is enforcing
that requirement which means that it's easy for drivers and new
developers to get this wrong. To avoid the confusion we can document
this requirement and clarify the solution.

This came up during a discussion on dri-devel:
https://lore.kernel.org/dri-devel/232f45e9-8748-1243-09bf-56763e6668b3@xxxxxxx

I took a slightly deeper look into this. I think we need to formalize 
this a bit more to understand pros and cons and what the restrictions 
are really all about. Anybody looking at the prevous discussion will 
mostly see arguments similar to "this is stupid and difficult" and "it 
has always been this way" which are not really constructive.

First disregarding all accounting stuff, I think this all boils down to 
TTM_PL_SYSTEM having three distinct states:
1) POPULATED
2) LIMBO (Or whatever we want to call it. No pages present)
3) SWAPPED.

The ttm_bo_move_memcpy() helper understands these, and any standalone 
driver implementation of the move() callback _currently_ needs to 
understand these as well, unless using the ttm_bo_move_memcpy() helper.

Now using a bounce domain to proxy SYSTEM means that the driver can 
forget about the SWAPPED state, it's automatically handled by the move 
setup code. However, another pitfall is LIMBO, in that if when you move 
from SYSTEM/LIMBO to your bounce domain, the BO will be populated. So 
any naive accelerated move() implementation creating a 1GB BO in fixed 
memory, like VRAM, will needlessly allocate and free 1GB of system 
memory in the process instead of just performing a clear operation. 
Looks like amdgpu suffers from this?

I think what is really needed is either

a) A TTM helper that helps move callback implementations resolve the 
issues populating system from LIMBO or SWAP, and then also formalize 
driver notification for swapping. At a minimum, I think the 
swap_notify() callback needs to be able to return a late error.

b) Make LIMBO and SWAPPED distinct memory regions. (I think I'd vote for 
this without looking into it in detail).

In both these cases, we should really make SYSTEM bindable by GPU, 
otherwise we'd just be trading one pitfall for another related without 
really resolving the root problem.

As for fencing not being supported by SYSTEM, I'm not sure why we don't 
want this, because it would for example prohibit async 
ttm_move_memcpy(), and also, async unbinding of ttm_tt memory like MOB 
on vmgfx. (I think it's still sync).

There might be an accounting issue related to this as well, but I guess 
Christian would need to chime in on this. If so, I think it needs to be 
well understood and documented (in TTM, not in AMD drivers).

Thanks,

/Thomas

Polite and gentle ping on that one. Are we ok with the wording here?

z