Re: [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

Christian König <christian.koenig@xxxxxxx> · Sat, 22 May 2021 10:30:19 +0200

Am 21.05.21 um 20:31 schrieb Daniel Vetter:
[SNIP]
We could provide an IOCTL for the BO to change the flag.
That's not the semantics we need.

But could we first figure out the semantics we want to use here?

Cause I'm pretty sure we don't actually need those changes at all and as
said before I'm certainly NAKing things which break existing use cases.
Please read how other drivers do this and at least _try_ to understand
it. I'm really loosing my patience here with you NAKing patches you're
not even understanding (or did you actually read and fully understand
the entire story I typed up here, and your NAK is on the entire
thing?). There's not much useful conversation to be had with that
approach. And with drivers I mean kernel + userspace here.

Well to be honest I did fully read that, but I was just to emotionally 
attached to answer more appropriately in that moment.

And I'm sorry that I react emotional on that, but it is really 
frustrating that I'm not able to convince you that we have a major 
problem which affects all drivers and not just amdgpu.

Regarding the reason why I'm NAKing this particular patch, you are 
breaking existing uAPI for RADV with that. And as a maintainer of the 
driver I have simply no other choice than saying halt, stop we can't do 
it like this.

I'm perfectly aware that I've some holes in the understanding of how ANV 
or other Vulkan/OpenGL stacks work. But you should probably also admit 
that you have some holes how amdgpu works or otherwise I can't imagine 
why you suggest a patch which simply breaks RADV.

I mean we are working together for years now and I think you know me 
pretty well, do you really think I scream bloody hell we can't do this 
without a good reason?

So let's stop throwing halve backed solutions at each other and discuss 
what we can do to solve the different problems we are both seeing here.

That's the other frustration part: You're trying to fix this purely in
the kernel. This is exactly one of these issues why we require open
source userspace, so that we can fix the issues correctly across the
entire stack. And meanwhile you're steadfastily refusing to even look
at that the userspace side of the picture.

Well I do fully understand the userspace side of the picture for the AMD 
stack. I just don't think we should give userspace that much control 
over the fences in the dma_resv object without untangling them from 
resource management.

And RADV is exercising exclusive sync for amdgpu already. You can do 
submission to both the GFX, Compute and SDMA queues in Vulkan and those 
currently won't over-synchronize.

When you then send a texture generated by multiple engines to the 
Compositor the kernel will correctly inserts waits for all submissions 
of the other process.

So this already works for RADV and completely without the IOCTL Jason 
proposed. IIRC we also have unit tests which exercised that feature for 
the video decoding use case long before RADV even existed.

And yes I have to admit that I haven't thought about interaction with 
other drivers when I came up with this because the rules of that 
interaction wasn't clear to me at that time.

Also I thought through your tlb issue, why are you even putting these
tlb flush fences into the shard dma_resv slots? If you store them
somewhere else in the amdgpu private part, the oversync issues goes
away
- in your ttm bo move callback, you can just make your bo copy job
depend on them too (you have to anyway)
- even for p2p there's not an issue here, because you have the
->move_notify callback, and can then lift the tlb flush fences from
your private place to the shared slots so the exporter can see them.

Because adding a shared fence requires that this shared fence signals 
after the exclusive fence. And this is a perfect example to explain why 
this is so problematic and also why why we currently stumble over that 
only in amdgpu.

In TTM we have a feature which allows evictions to be pipelined and 
don't wait for the evicting DMA operation. Without that driver will 
stall waiting for their allocations to finish when we need to allocate 
memory.

For certain use cases this gives you a ~20% fps increase under memory 
pressure, so it is a really important feature.

This works by adding the fence of the last eviction DMA operation to BOs 
when their backing store is newly allocated. That's what the 
ttm_bo_add_move_fence() function you stumbled over is good for: 
https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo.c#L692

Now the problem is it is possible that the application is terminated 
before it can complete it's command submission. But since resource 
management only waits for the shared fences when there are some there is 
a chance that we free up memory while it is still in use.

Because of this we have some rather crude workarounds in amdgpu. For 
example IIRC we manual wait for any potential exclusive fence before 
freeing memory.

We could enable this feature for radeon and nouveau as well with an one 
line change. But that would mean we need to maintain the workarounds for 
shortcomings of the dma_resv object design in those drivers as well.

To summarize I think that adding an unbound fence to protect an object 
is a perfectly valid operation for resource management, but this is 
restricted by the needs of implicit sync at the moment.

The kernel move fences otoh are a bit more nasty to wring through the
p2p dma-buf interface. That one probably needs something new.

Well the p2p interface are my least concern.

Adding the move fence means that you need to touch every place we do CS 
or page flip since you now have something which is parallel to the 
explicit sync fence.

Otherwise having the move fence separately wouldn't make much sense in 
the first place if we always set it together with the exclusive fence.

Best regards and sorry for getting on your nerves so much,
Christian.

-Daniel

Regards,
Christian.

-Daniel

Are you bored enough to type this up for radv? I'll give Jason's kernel
stuff another review meanwhile.
-Daniel

                  e->bo_va = amdgpu_vm_bo_find(vm, bo);
          }
--
2.31.0

--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Cf0852f38c85046ca877908d91c86a719%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637572186953277692%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Vgz%2FkXFH4CD6ktZBnxnXFhHTG5tHhN1%2BDyf7pmxak6c%3D&amp;reserved=0