> -----Original Message----- > From: Daniel Vetter <daniel@xxxxxxxx> > Sent: Thursday, January 25, 2024 1:33 PM > To: Christian König <christian.koenig@xxxxxxx> > Cc: Zeng, Oak <oak.zeng@xxxxxxxxx>; Danilo Krummrich <dakr@xxxxxxxxxx>; > Dave Airlie <airlied@xxxxxxxxxx>; Daniel Vetter <daniel@xxxxxxxx>; Felix > Kuehling <felix.kuehling@xxxxxxx>; Welty, Brian <brian.welty@xxxxxxxxx>; dri- > devel@xxxxxxxxxxxxxxxxxxxxx; intel-xe@xxxxxxxxxxxxxxxxxxxxx; Bommu, Krishnaiah > <krishnaiah.bommu@xxxxxxxxx>; Ghimiray, Himal Prasad > <himal.prasad.ghimiray@xxxxxxxxx>; Thomas.Hellstrom@xxxxxxxxxxxxxxx; > Vishwanathapura, Niranjana <niranjana.vishwanathapura@xxxxxxxxx>; Brost, > Matthew <matthew.brost@xxxxxxxxx>; Gupta, saurabhg > <saurabhg.gupta@xxxxxxxxx> > Subject: Re: Making drm_gpuvm work across gpu devices > > On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote: > > Am 23.01.24 um 20:37 schrieb Zeng, Oak: > > > [SNIP] > > > Yes most API are per device based. > > > > > > One exception I know is actually the kfd SVM API. If you look at the svm_ioctl > function, it is per-process based. Each kfd_process represent a process across N > gpu devices. > > > > Yeah and that was a big mistake in my opinion. We should really not do that > > ever again. > > > > > Need to say, kfd SVM represent a shared virtual address space across CPU > and all GPU devices on the system. This is by the definition of SVM (shared virtual > memory). This is very different from our legacy gpu *device* driver which works > for only one device (i.e., if you want one device to access another device's > memory, you will have to use dma-buf export/import etc). > > > > Exactly that thinking is what we have currently found as blocker for a > > virtualization projects. Having SVM as device independent feature which > > somehow ties to the process address space turned out to be an extremely bad > > idea. > > > > The background is that this only works for some use cases but not all of > > them. > > > > What's working much better is to just have a mirror functionality which says > > that a range A..B of the process address space is mapped into a range C..D > > of the GPU address space. > > > > Those ranges can then be used to implement the SVM feature required for > > higher level APIs and not something you need at the UAPI or even inside the > > low level kernel memory management. > > > > When you talk about migrating memory to a device you also do this on a per > > device basis and *not* tied to the process address space. If you then get > > crappy performance because userspace gave contradicting information where > to > > migrate memory then that's a bug in userspace and not something the kernel > > should try to prevent somehow. > > > > [SNIP] > > > > I think if you start using the same drm_gpuvm for multiple devices you > > > > will sooner or later start to run into the same mess we have seen with > > > > KFD, where we moved more and more functionality from the KFD to the > DRM > > > > render node because we found that a lot of the stuff simply doesn't work > > > > correctly with a single object to maintain the state. > > > As I understand it, KFD is designed to work across devices. A single pseudo > /dev/kfd device represent all hardware gpu devices. That is why during kfd open, > many pdd (process device data) is created, each for one hardware device for this > process. > > > > Yes, I'm perfectly aware of that. And I can only repeat myself that I see > > this design as a rather extreme failure. And I think it's one of the reasons > > why NVidia is so dominant with Cuda. > > > > This whole approach KFD takes was designed with the idea of extending the > > CPU process into the GPUs, but this idea only works for a few use cases and > > is not something we should apply to drivers in general. > > > > A very good example are virtualization use cases where you end up with CPU > > address != GPU address because the VAs are actually coming from the guest > VM > > and not the host process. > > > > SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have > > any influence on the design of the kernel UAPI. > > > > If you want to do something similar as KFD for Xe I think you need to get > > explicit permission to do this from Dave and Daniel and maybe even Linus. > > I think the one and only one exception where an SVM uapi like in kfd makes > sense, is if the _hardware_ itself, not the software stack defined > semantics that you've happened to build on top of that hw, enforces a 1:1 > mapping with the cpu process address space. > > Which means your hardware is using PASID, IOMMU based translation, PCI-ATS > (address translation services) or whatever your hw calls it and has _no_ > device-side pagetables on top. Which from what I've seen all devices with > device-memory have, simply because they need some place to store whether > that memory is currently in device memory or should be translated using > PASID. Currently there's no gpu that works with PASID only, but there are > some on-cpu-die accelerator things that do work like that. > > Maybe in the future there will be some accelerators that are fully cpu > cache coherent (including atomics) with something like CXL, and the > on-device memory is managed as normal system memory with struct page as > ZONE_DEVICE and accelerator va -> physical address translation is only > done with PASID ... but for now I haven't seen that, definitely not in > upstream drivers. > > And the moment you have some per-device pagetables or per-device memory > management of some sort (like using gpuva mgr) then I'm 100% agreeing with > Christian that the kfd SVM model is too strict and not a great idea. > GPU is nothing more than a piece of HW to accelerate part of a program, just like an extra CPU core. From this perspective, a unified virtual address space across CPU and all GPU devices (and any other accelerators) is always more convenient to program than split address space b/t devices. In reality, GPU program started from split address space. HMM is designed to provide unified virtual address space w/o a lot of advanced hardware feature you listed above. I am aware Nvidia's new hardware platforms such as Grace Hopper natively support the Unified Memory programming model through hardware-based memory coherence among all CPUs and GPUs. For such systems, HMM is not required. You can think HMM as a software based solution to provide unified address space b/t cpu and devices. Both AMD and Nvidia have been providing unified address space through hmm. I think it is still valuable. Regards, Oak > Cheers, Sima > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch