On Mon, 2021-11-22 at 15:15 +0100, Christian König wrote: > Am 16.11.21 um 16:53 schrieb Zack Rusin: > > > On Nov 16, 2021, at 03:20, Christian König > > > <christian.koenig@xxxxxxx> wrote: > > > > > > Am 16.11.21 um 08:43 schrieb Thomas Hellström: > > > > On 11/16/21 08:19, Christian König wrote: > > > > > Am 13.11.21 um 12:26 schrieb Thomas Hellström: > > > > > > Hi, Zack, > > > > > > > > > > > > On 11/11/21 17:44, Zack Rusin wrote: > > > > > > > On Wed, 2021-11-10 at 09:50 -0500, Zack Rusin wrote: > > > > > > > > TTM takes full control over TTM_PL_SYSTEM placed > > > > > > > > buffers. This makes > > > > > > > > driver internal usage of TTM_PL_SYSTEM prone to errors > > > > > > > > because it > > > > > > > > requires the drivers to manually handle all > > > > > > > > interactions between TTM > > > > > > > > which can swap out those buffers whenever it thinks > > > > > > > > it's the right > > > > > > > > thing to do and driver. > > > > > > > > > > > > > > > > CPU buffers which need to be fenced and shared with > > > > > > > > accelerators > > > > > > > > should > > > > > > > > be placed in driver specific placements that can > > > > > > > > explicitly handle > > > > > > > > CPU/accelerator buffer fencing. > > > > > > > > Currently, apart, from things silently failing nothing > > > > > > > > is enforcing > > > > > > > > that requirement which means that it's easy for drivers > > > > > > > > and new > > > > > > > > developers to get this wrong. To avoid the confusion we > > > > > > > > can document > > > > > > > > this requirement and clarify the solution. > > > > > > > > > > > > > > > > This came up during a discussion on dri-devel: > > > > > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F232f45e9-8748-1243-09bf-56763e6668b3%40amd.com&data=04%7C01%7Czackr%40vmware.com%7C084389d8acc04ffdbbb808d9adc28cfe%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637731873379429725%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ahHm7HV9VPAhfBKsk9xOBcnObsJXHvVAbCdEhJ%2BjJYU%3D&reserved=0 > > > > > > I took a slightly deeper look into this. I think we need to > > > > > > formalize this a bit more to understand pros and cons and > > > > > > what the restrictions are really all about. Anybody looking > > > > > > at the prevous discussion will mostly see arguments similar > > > > > > to "this is stupid and difficult" and "it has always been > > > > > > this way" which are not really constructive. > > > > > > > > > > > > First disregarding all accounting stuff, I think this all > > > > > > boils down to TTM_PL_SYSTEM having three distinct states: > > > > > > 1) POPULATED > > > > > > 2) LIMBO (Or whatever we want to call it. No pages present) > > > > > > 3) SWAPPED. > > > > > > > > > > > > The ttm_bo_move_memcpy() helper understands these, and any > > > > > > standalone driver implementation of the move() callback > > > > > > _currently_ needs to understand these as well, unless using > > > > > > the ttm_bo_move_memcpy() helper. > > > > > > > > > > > > Now using a bounce domain to proxy SYSTEM means that the > > > > > > driver can forget about the SWAPPED state, it's > > > > > > automatically handled by the move setup code. However, > > > > > > another pitfall is LIMBO, in that if when you move from > > > > > > SYSTEM/LIMBO to your bounce domain, the BO will be > > > > > > populated. So any naive accelerated move() implementation > > > > > > creating a 1GB BO in fixed memory, like VRAM, will > > > > > > needlessly allocate and free 1GB of system memory in the > > > > > > process instead of just performing a clear operation. Looks > > > > > > like amdgpu suffers from this? > > > > > > > > > > > > I think what is really needed is either > > > > > > > > > > > > a) A TTM helper that helps move callback implementations > > > > > > resolve the issues populating system from LIMBO or SWAP, > > > > > > and then also formalize driver notification for swapping. > > > > > > At a minimum, I think the swap_notify() callback needs to > > > > > > be able to return a late error. > > > > > > > > > > > > b) Make LIMBO and SWAPPED distinct memory regions. (I think > > > > > > I'd vote for this without looking into it in detail). > > > > > > > > > > > > In both these cases, we should really make SYSTEM bindable > > > > > > by GPU, otherwise we'd just be trading one pitfall for > > > > > > another related without really resolving the root problem. > > > > > > > > > > > > As for fencing not being supported by SYSTEM, I'm not sure > > > > > > why we don't want this, because it would for example > > > > > > prohibit async ttm_move_memcpy(), and also, async unbinding > > > > > > of ttm_tt memory like MOB on vmgfx. (I think it's still > > > > > > sync). > > > > > > > > > > > > There might be an accounting issue related to this as well, > > > > > > but I guess Christian would need to chime in on this. If > > > > > > so, I think it needs to be well understood and documented > > > > > > (in TTM, not in AMD drivers). > > > > > I think the problem goes deeper than what has been mentioned > > > > > here so far. > > > > > > > > > > Having fences attached to BOs in the system domain is > > > > > probably ok, but the key point is that the BOs in the system > > > > > domain are under TTMs control and should not be touched by > > > > > the driver. > > > > > > > > > > What we have now is that TTMs internals like the allocation > > > > > state of BOs in system memory (the populated, limbo, swapped > > > > > you mentioned above) is leaking into the drivers and I think > > > > > exactly that is the part which doesn't work reliable here. > > > > > You can of course can get that working, but that requires > > > > > knowledge of the internal state which in my eyes was always > > > > > illegal. > > > > > > > > > Well, I tend to agree to some extent, but then, like said above > > > > even disregarding swap will cause trouble with the limbo state, > > > > because the driver's move callback would need knowledge of that > > > > to implement moves limbo -> vram efficiently. > > > Well my long term plan is to audit the code base once more and > > > remove the limbo state from the SYSTEM domain. > > > > > > E.g. instead of a SYSTEM BO without pages you allocate a BO > > > without a resource in general which is now possible since bo- > > > >resource is a pointer. > > > > > > This would still allow us to allocate "empty shell" BOs. But a > > > validation of those BOs doesn't cause a move, but rather just > > > allocates the resource for the first time. > > > > > > The problem so far was just that we access bo->resource way to > > > often without checking it. > > So all in all this would be a two step process 1) Eliminate the > > “limbo” state, 2) Split PL_SYSTEM into PL_SYSTEM, that drivers > > could use for really anything, and PL_SWAP which would be under > > complete TTM control, yes? If that’s the case that would certainly > > make my life a lot easier (because the drivers wouldn’t need to > > introduce/manage their own system placements) and I think it would > > make the code a lot easier to understand. > > We also have a couple of other use cases for this, yes. > > > That’s a bit of work though so the question what happens until this > > lands comes to mind. Is introduction of VMW_PL_SYSTEM and > > documenting that PL_SYSTEM is under complete TTM control (like this > > patch does) the way to go or do we want to start working on the > > above immediately? Because I’d love to be able to unload vmwgfx > > without kernel oopsing =) > > I think documenting and getting into a clean state should be > prerequisite for new development (even if the fix for the unload is > trivial). That sounds great (both points). Could you double check the wording and ack it (ideally both the docs and the VMW_PL_SYSTEM patch) if you think it's ready to go in? z