Re: Making drm_gpuvm work across gpu devices

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Oak,

sorry the mail sounded like you didn't expected a reply.

And yes, the approaches outlined in the mail sounds really good to me.

Regards,
Christian.

Am 08.03.24 um 05:43 schrieb Zeng, Oak:
Hello all,

Since I didn't get a reply for this one, I assume below are agreed. But feel free to let us know if you don't agree.

Thanks,
Oak

-----Original Message-----
From: dri-devel <dri-devel-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Zeng, Oak
Sent: Thursday, February 29, 2024 1:23 PM
To: Christian König <christian.koenig@xxxxxxx>; Daniel Vetter <daniel@xxxxxxxx>; David Airlie <airlied@xxxxxxxxxx>
Cc: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx>; Brost, Matthew <matthew.brost@xxxxxxxxx>; Felix Kuehling <felix.kuehling@xxxxxxx>; Welty, Brian <brian.welty@xxxxxxxxx>; dri-devel@xxxxxxxxxxxxxxxxxxxxx; Ghimiray, Himal Prasad <himal.prasad.ghimiray@xxxxxxxxx>; Bommu, Krishnaiah <krishnaiah.bommu@xxxxxxxxx>; Gupta, saurabhg <saurabhg.gupta@xxxxxxxxx>; Vishwanathapura, Niranjana <niranjana.vishwanathapura@xxxxxxxxx>; intel-xe@xxxxxxxxxxxxxxxxxxxxx; Danilo Krummrich <dakr@xxxxxxxxxx>; Shah, Ankur N <ankur.n.shah@xxxxxxxxx>; jglisse@xxxxxxxxxx; rcampbell@xxxxxxxxxx; apopple@xxxxxxxxxx
Subject: RE: Making drm_gpuvm work across gpu devices

Hi Christian/Daniel/Dave/Felix/Thomas, and all,

We have been refining our design internally in the past month. Below is our plan. Please let us know if you have any concern.

1) Remove pseudo /dev/xe-svm device. All system allocator interfaces will be through /dev/dri/render devices. Not global interface.

2) Unify userptr and system allocator codes. We will treat userptr as a speciality of system allocator without migration capability. We will introduce the hmmptr concept for system allocator. We will extend vm_bind API to map a range A..B of process address space to a range C..D of GPU address space for hmmptr. For hmmptr, if gpu program accesses an address which is not backed by core mm vma, it is a fatal error.

3) Multiple device support. We have identified p2p use-cases where we might want to leave memory on a foreign device or direct migrations to a foreign device and therefore might need a global structure that tracks or caches the migration state per process address space. We didn't completely settle down this design. We will come back when we have more details.

4)We will first work on this code on xekmd then look to move some common codes to drm layer so it can also be used by other vendors.

Thomas and me still have open questions to Christian. We will follow up.

Thanks all for this discussion.

Regards,
Oak

-----Original Message-----
From: Christian König <christian.koenig@xxxxxxx>
Sent: Thursday, February 1, 2024 3:52 AM
To: Zeng, Oak <oak.zeng@xxxxxxxxx>; Daniel Vetter <daniel@xxxxxxxx>; David
Airlie <airlied@xxxxxxxxxx>
Cc: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx>; Brost, Matthew
<matthew.brost@xxxxxxxxx>; Felix Kuehling <felix.kuehling@xxxxxxx>; Welty,
Brian <brian.welty@xxxxxxxxx>; dri-devel@xxxxxxxxxxxxxxxxxxxxx; Ghimiray, Himal
Prasad <himal.prasad.ghimiray@xxxxxxxxx>; Bommu, Krishnaiah
<krishnaiah.bommu@xxxxxxxxx>; Gupta, saurabhg <saurabhg.gupta@xxxxxxxxx>;
Vishwanathapura, Niranjana <niranjana.vishwanathapura@xxxxxxxxx>; intel-
xe@xxxxxxxxxxxxxxxxxxxxx; Danilo Krummrich <dakr@xxxxxxxxxx>; Shah, Ankur N
<ankur.n.shah@xxxxxxxxx>; jglisse@xxxxxxxxxx; rcampbell@xxxxxxxxxx;
apopple@xxxxxxxxxx
Subject: Re: Making drm_gpuvm work across gpu devices

Hi Oak,

Am 31.01.24 um 21:17 schrieb Zeng, Oak:
Hi Sima, Dave,

I am well aware nouveau driver is not what Nvidia do with their customer. The
key argument is, can we move forward with the concept shared virtual address
space b/t CPU and GPU? This is the foundation of HMM. We already have split
address space support with other driver API. SVM, from its name, it means
shared address space. Are we allowed to implement another driver model to
allow SVM work, along with other APIs supporting split address space? Those two
scheme can co-exist in harmony. We actually have real use cases to use both
models in one application.
Hi Christian, Thomas,

In your scheme, GPU VA can != GPU VA. This does introduce some flexibility.
But this scheme alone doesn't solve the problem of the proxy process/para-
virtualization. You will still need a second mechanism to partition GPU VA space
b/t guest process1 and guest process2 because proxy process (or the host
hypervisor whatever you call it) use one single gpu page table for all the
guest/client processes. GPU VA for different guest process can't overlap. If this
second mechanism exist, we of course can use the same mechanism to partition
CPU VA space between guest processes as well, then we can still use shared VA
b/t CPU and GPU inside one process, but process1 and process2's address space
(for both cpu and gpu) doesn't overlap. This second mechanism is the key to
solve the proxy process problem, not the flexibility you introduced.

That approach was suggested before, but it doesn't work. First of all
you create a massive security hole when you give the GPU full access to
the QEMU CPU process which runs the virtualization.

So even if you say CPU VA == GPU VA you still need some kind of
flexibility otherwise you can't implement this use case securely.

Additional to this the CPU VAs are usually controlled by the OS and not
some driver, so to make sure that host and guest VAs don't overlap you
would need to add some kind of sync between the guest and host OS kernels.

In practice, your scheme also have a risk of running out of process space
because you have to partition whole address space b/t processes. Apparently
allowing each guest process to own the whole process space and using separate
GPU/CPU page table for different processes is a better solution than using single
page table and partition process space b/t processes.

Yeah that you run out of address space is certainly possible. But as I
said CPUs are switching to 5 level of pages tables and if you look at
for example a "cat maps | cut -c-4 | sort -u" of process you will find
that only a handful of 4GiB segments are actually used and thanks to
recoverable page faults you can map those between host and client on
demand. This gives you at least enough address space to handle a couple
of thousand clients.

For Intel GPU, para-virtualization (xenGT, see https://github.com/intel/XenGT-
Preview-kernel. It is similar idea of the proxy process in Flex's email. They are all
SW-based GPU virtualization technology) is an old project. It is now replaced with
HW accelerated SRIOV/system virtualization. XenGT is abandoned long time ago.
So agreed your scheme add some flexibility. The question is, do we have a valid
use case to use such flexibility? I don't see a single one ATM.

Yeah, we have SRIOV functionality on AMD hw as well, but for some use
cases it's just to inflexible.

I also pictured into how to implement your scheme. You basically rejected the
very foundation of hmm design which is shared address space b/t CPU and GPU.
In your scheme, GPU VA = CPU VA + offset. In every single place where driver
need to call hmm facilities such as hmm_range_fault, migrate_vma_setup and in
mmu notifier call back, you need to offset the GPU VA to get a CPU VA. From
application writer's perspective, whenever he want to use a CPU pointer in his
GPU program, he add to add that offset. Do you think this is awkward?

What? This flexibility is there to prevent the application writer to
change any offset.

Finally, to implement SVM, we need to implement some memory hint API
which applies to a virtual address range across all GPU devices. For example, user
would say, for this virtual address range, I prefer the backing store memory to be
on GPU deviceX (because user knows deviceX would use this address range
much more than other GPU devices or CPU). It doesn't make sense to me to
make such API per device based. For example, if you tell device A that the
preferred memory location is device B memory, this doesn't sounds correct to
me because in your scheme, device A is not even aware of the existence of
device B. right?

Correct and while the additional flexibility is somewhat option I
strongly think that not having a centralized approach for device driver
settings is mandatory.

Going away from the well defined file descriptor based handling of
device driver interfaces was one of the worst ideas I've ever seen in
roughly thirty years of working with Unixiode operating systems. It
basically broke everything, from reverse lockup handling for mmap() to
file system privileges for hardware access.

As far as I can see anything which goes into the direction of opening
/dev/kfd or /dev/xe_svm or something similar and saying that this then
results into implicit SVM for your render nodes is an absolutely no-go
and would required and explicit acknowledgement from Linus on the design
to do something like that.

What you can do is to have an IOCTL for the render node file descriptor
which says this device should do SVM with the current CPU address space
and another IOCTL which says range A..B is preferred to migrate to this
device for HMM when the device runs into a page fault.

And yes that obviously means shitty performance for device drivers
because page play ping/pong if userspace gives contradicting information
for migrations, but that is something supposed to be.

Everything else which works over the boarders of a device driver scope
should be implemented as system call with the relevant review process
around it.

Regards,
Christian.

Regards,
Oak
-----Original Message-----
From: Daniel Vetter <daniel@xxxxxxxx>
Sent: Wednesday, January 31, 2024 4:15 AM
To: David Airlie <airlied@xxxxxxxxxx>
Cc: Zeng, Oak <oak.zeng@xxxxxxxxx>; Christian König
<christian.koenig@xxxxxxx>; Thomas Hellström
<thomas.hellstrom@xxxxxxxxxxxxxxx>; Daniel Vetter <daniel@xxxxxxxx>; Brost,
Matthew <matthew.brost@xxxxxxxxx>; Felix Kuehling
<felix.kuehling@xxxxxxx>; Welty, Brian <brian.welty@xxxxxxxxx>; dri-
devel@xxxxxxxxxxxxxxxxxxxxx; Ghimiray, Himal Prasad
<himal.prasad.ghimiray@xxxxxxxxx>; Bommu, Krishnaiah
<krishnaiah.bommu@xxxxxxxxx>; Gupta, saurabhg
<saurabhg.gupta@xxxxxxxxx>;
Vishwanathapura, Niranjana <niranjana.vishwanathapura@xxxxxxxxx>; intel-
xe@xxxxxxxxxxxxxxxxxxxxx; Danilo Krummrich <dakr@xxxxxxxxxx>; Shah,
Ankur N
<ankur.n.shah@xxxxxxxxx>; jglisse@xxxxxxxxxx; rcampbell@xxxxxxxxxx;
apopple@xxxxxxxxxx
Subject: Re: Making drm_gpuvm work across gpu devices

On Wed, Jan 31, 2024 at 09:12:39AM +1000, David Airlie wrote:
On Wed, Jan 31, 2024 at 8:29 AM Zeng, Oak <oak.zeng@xxxxxxxxx> wrote:
Hi Christian,



Nvidia Nouveau driver uses exactly the same concept of SVM with HMM,
GPU address in the same process is exactly the same with CPU virtual address.
It
is already in upstream Linux kernel. We Intel just follow the same direction for
our customers. Why we are not allowed?
Oak, this isn't how upstream works, you don't get to appeal to
customer or internal design. nouveau isn't "NVIDIA"'s and it certainly
isn't something NVIDIA would ever suggest for their customers. We also
likely wouldn't just accept NVIDIA's current solution upstream without
some serious discussions. The implementation in nouveau was more of a
sample HMM use case rather than a serious implementation. I suspect if
we do get down the road of making nouveau an actual compute driver for
SVM etc then it would have to severely change.
Yeah on the nouveau hmm code specifically my gut feeling impression is
that we didn't really make friends with that among core kernel
maintainers. It's a bit too much just a tech demo to be able to merge the
hmm core apis for nvidia's out-of-tree driver.

Also, a few years of learning and experience gaining happened meanwhile -
you always have to look at an api design in the context of when it was
designed, and that context changes all the time.

Cheers, Sima
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch




[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux