Re: [Nouveau] [PATCH drm-next 05/14] drm/nouveau: new VM_BIND uapi interfaces

Christian König <christian.koenig@xxxxxxx> · Mon, 6 Feb 2023 10:48:34 +0100

Am 02.02.23 um 19:31 schrieb Danilo Krummrich:
On 2/2/23 12:53, Christian König wrote:
Am 01.02.23 um 09:10 schrieb Dave Airlie:
[SNIP]
For drivers that don't intend to merge at all and (somehow) are
capable of dealing with sparse regions without knowing the sparse
region's boundaries, it'd be easy to make those gpuva_regions 
optional.
Yeah, but this then defeats the approach of having the same hw
independent interface/implementation for all drivers.
I think you are running a few steps ahead here. The plan isn't to have
an independent interface, it's to provide a set of routines and
tracking that will be consistent across drivers, so that all drivers
once using them will operate in mostly the same fashion with respect
to GPU VA tracking and VA/BO lifetimes. Already in the tree we have
amdgpu and freedreno which I think end up operating slightly different
around lifetimes. I'd like to save future driver writers the effort of
dealing with those decisions and this should drive their user api
design so to enable vulkan sparse bindings.

Ok in this case I'm pretty sure this is *NOT* a good idea.

See this means that we define the UAPI implicitly by saying to 
drivers to use a common framework for their VM implementation which 
then results in behavior A,B,C,D....

If a driver strides away from this common framework because it has 
different requirements based on how his hw work you certainly get 
different behavior again (and you have tons of hw specific 
requirements in here).

What we should do instead if we want to have some common handling 
among drivers (which I totally agree on makes sense) then we should 
define the UAPI explicitly.

By asking that I don't want to say I'm against this idea, I'm just 
wondering how it becomes easier to deal with "tons of hw specific 
requirements" by generalizing things even more?

I'm already maintaining two different GPU VM solutions in the GPU 
drivers in the kernel, radeon and amdgpu. The hw they driver is 
identical, just the UAPI is different. And only because of the different 
UAPI they can't have the same VM backend implementation.

The hw stuff is completely abstract able. That's just stuff you need to 
consider when defining the structures you pass around.

But a messed up UAPI is sometimes impossible to fix because of backward 
compatibility.

We learned that the hard way with radeon and mostly fixed it by coming 
up with a completely new implementation for amdgpu.

What makes us think that we do a better job in considering all hw 
specific requirements with a unified UAPI than with a more lightweight 
generic component for tracking VA mappings?

Because this defines the UAPI implicitly and that's seldom a good idea.

As I said before tracking is the easy part of the job. Defining this 
generic component helps a little bit writing new drivers, but it leaves 
way to much room for speculations on the UAPI.

Also, wouldn't we need something like the GPUVA manager as part of a 
unified UAPI?

Not necessarily. We can write components to help drivers implement the 
UAPI, but this isn't mandatory.

For example we could have a DRM_IOCTL_GPU_VM which takes both driver 
independent as well as driver dependent information and then has the 
documented behavior:
a) VAs do (or don't) vanish automatically when the GEM handle is closed.
b) GEM BOs do (or don't) get an additional reference for each VM they 
are used in.
c) Can handle some common use cases driver independent (BO mappings, 
readonly, writeonly, sparse etc...).
d) Has a well defined behavior when the operation is executed async. 
E.g. in/out fences.
e) Can still handle hw specific stuff like (for example) trap on 
access etc....
...

Especially d is what Bas and I have pretty much already created a 
prototype for the amdgpu specific IOCTL for, but essentially this is 
completely driver independent and actually the more complex stuff. 
Compared to that common lifetime of BOs is just nice to have.

I strongly think we should concentrate on getting this right as well.

Now if merging is a feature that makes sense to one driver maybe it
makes sense to all, however there may be reasons amdgpu gets away
without merging that other drivers might not benefit from, there might
also be a benefit to amdgpu from merging that you haven't looked at
yet, so I think we could leave merging as an optional extra driver
knob here. The userspace API should operate the same, it would just be
the gpu pagetables that would end up different sizes.

Yeah, agree completely. The point is that we should not have 
complexity inside the kernel which is not necessarily needed in the 
kernel.

So merging or not is something we have gone back and forth for 
amdgpu, one the one hand it reduces the memory footprint of the 
housekeeping overhead on the other hand it makes the handling more 
complex, error prone and use a few more CPU cycles.

For amdgpu merging is mostly beneficial when you can get rid of a 
whole page tables layer in the hierarchy, but for this you need to 
merge at least 2MiB or 1GiB together. And since that case doesn't 
happen that often we stopped doing it.

But for my understanding why you need the ranges for the merging? 
Isn't it sufficient to check that the mappings have the same type, 
flags, BO, whatever backing them?

Not entirely. Let's assume userspace creates two virtually contiguous 
buffers (VKBuffer) A and B. Userspace could bind a BO with BO offset 0 
to A (binding 1) and afterwards bind the same BO with BO offset 
length(A) to B (binding 2), maybe unlikely but AFAIK not illegal.

If we don't know about the bounds of A and B in the kernel, we detect 
that both bindings are virtually and physically contiguous and we 
merge them.

Well as far as I can see this is actually legal and desirable.

In the best case this was simply useless, because we'll need to split 
them anyway later on when A or B is destroyed, but in the worst case 
we could fault the GPU, e.g. if merging leads to a change of the page 
tables that are backing binding 1, but buffer A is already in use by 
userspace.

WOW wait a second, regions absolutely don't help you with that anyway.

You need to keep track which mappings are used or otherwise any 
modification could lead to problems.

In other words when the GPU already uses A you *must* have a fence on 
the page tables backing A to prevent their destruction.

In Nouveau, I think we could also get rid of regions and do something 
driver specific for the handling of the dual page tables, which I want 
to use for sparse regions *and* just don't merge (at least for now). 
But exactly for the sake of not limiting drivers in their HW specifics 
I thought it'd be great if merging is supported in case it makes sense 
for a specific HW, especially given the fact that memory sizes are 
increasing.

What do you mean with that?

If you want your page tables to be modifiable while the GPU is using 
them (which is basically a standard requirement from sparse bindings in 
Vulkan) you need double housekeeping anyway.

Those regions strongly sound like you are pushing stuff which should be 
handled in userspace inside the kernel.

Regards,
Christian.

Dave.