RE: Making drm_gpuvm work across gpu devices

"Zeng, Oak" <oak.zeng@xxxxxxxxx> · Thu, 25 Jan 2024 21:02:02 +0000

> -----Original Message-----
> From: Daniel Vetter <daniel@xxxxxxxx>
> Sent: Thursday, January 25, 2024 1:33 PM
> To: Christian König <christian.koenig@xxxxxxx>
> Cc: Zeng, Oak <oak.zeng@xxxxxxxxx>; Danilo Krummrich <dakr@xxxxxxxxxx>;
> Dave Airlie <airlied@xxxxxxxxxx>; Daniel Vetter <daniel@xxxxxxxx>; Felix
> Kuehling <felix.kuehling@xxxxxxx>; Welty, Brian <brian.welty@xxxxxxxxx>; dri-
> devel@xxxxxxxxxxxxxxxxxxxxx; intel-xe@xxxxxxxxxxxxxxxxxxxxx; Bommu, Krishnaiah
> <krishnaiah.bommu@xxxxxxxxx>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@xxxxxxxxx>; Thomas.Hellstrom@xxxxxxxxxxxxxxx;
> Vishwanathapura, Niranjana <niranjana.vishwanathapura@xxxxxxxxx>; Brost,
> Matthew <matthew.brost@xxxxxxxxx>; Gupta, saurabhg
> <saurabhg.gupta@xxxxxxxxx>
> Subject: Re: Making drm_gpuvm work across gpu devices
> 
> On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:
> > Am 23.01.24 um 20:37 schrieb Zeng, Oak:
> > > [SNIP]
> > > Yes most API are per device based.
> > >
> > > One exception I know is actually the kfd SVM API. If you look at the svm_ioctl
> function, it is per-process based. Each kfd_process represent a process across N
> gpu devices.
> >
> > Yeah and that was a big mistake in my opinion. We should really not do that
> > ever again.
> >
> > > Need to say, kfd SVM represent a shared virtual address space across CPU
> and all GPU devices on the system. This is by the definition of SVM (shared virtual
> memory). This is very different from our legacy gpu *device* driver which works
> for only one device (i.e., if you want one device to access another device's
> memory, you will have to use dma-buf export/import etc).
> >
> > Exactly that thinking is what we have currently found as blocker for a
> > virtualization projects. Having SVM as device independent feature which
> > somehow ties to the process address space turned out to be an extremely bad
> > idea.
> >
> > The background is that this only works for some use cases but not all of
> > them.
> >
> > What's working much better is to just have a mirror functionality which says
> > that a range A..B of the process address space is mapped into a range C..D
> > of the GPU address space.
> >
> > Those ranges can then be used to implement the SVM feature required for
> > higher level APIs and not something you need at the UAPI or even inside the
> > low level kernel memory management.
> >
> > When you talk about migrating memory to a device you also do this on a per
> > device basis and *not* tied to the process address space. If you then get
> > crappy performance because userspace gave contradicting information where
> to
> > migrate memory then that's a bug in userspace and not something the kernel
> > should try to prevent somehow.
> >
> > [SNIP]
> > > > I think if you start using the same drm_gpuvm for multiple devices you
> > > > will sooner or later start to run into the same mess we have seen with
> > > > KFD, where we moved more and more functionality from the KFD to the
> DRM
> > > > render node because we found that a lot of the stuff simply doesn't work
> > > > correctly with a single object to maintain the state.
> > > As I understand it, KFD is designed to work across devices. A single pseudo
> /dev/kfd device represent all hardware gpu devices. That is why during kfd open,
> many pdd (process device data) is created, each for one hardware device for this
> process.
> >
> > Yes, I'm perfectly aware of that. And I can only repeat myself that I see
> > this design as a rather extreme failure. And I think it's one of the reasons
> > why NVidia is so dominant with Cuda.
> >
> > This whole approach KFD takes was designed with the idea of extending the
> > CPU process into the GPUs, but this idea only works for a few use cases and
> > is not something we should apply to drivers in general.
> >
> > A very good example are virtualization use cases where you end up with CPU
> > address != GPU address because the VAs are actually coming from the guest
> VM
> > and not the host process.
> >
> > SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have
> > any influence on the design of the kernel UAPI.
> >
> > If you want to do something similar as KFD for Xe I think you need to get
> > explicit permission to do this from Dave and Daniel and maybe even Linus.
> 
> I think the one and only one exception where an SVM uapi like in kfd makes
> sense, is if the _hardware_ itself, not the software stack defined
> semantics that you've happened to build on top of that hw, enforces a 1:1
> mapping with the cpu process address space.
> 
> Which means your hardware is using PASID, IOMMU based translation, PCI-ATS
> (address translation services) or whatever your hw calls it and has _no_
> device-side pagetables on top. Which from what I've seen all devices with
> device-memory have, simply because they need some place to store whether
> that memory is currently in device memory or should be translated using
> PASID. Currently there's no gpu that works with PASID only, but there are
> some on-cpu-die accelerator things that do work like that.
> 
> Maybe in the future there will be some accelerators that are fully cpu
> cache coherent (including atomics) with something like CXL, and the
> on-device memory is managed as normal system memory with struct page as
> ZONE_DEVICE and accelerator va -> physical address translation is only
> done with PASID ... but for now I haven't seen that, definitely not in
> upstream drivers.
> 
> And the moment you have some per-device pagetables or per-device memory
> management of some sort (like using gpuva mgr) then I'm 100% agreeing with
> Christian that the kfd SVM model is too strict and not a great idea.
> 

GPU is nothing more than a piece of HW to accelerate part of a program, just like an extra CPU core. From this perspective, a unified virtual address space across CPU and all GPU devices (and any other accelerators) is always more convenient to program than split address space b/t devices.

In reality, GPU program started from split address space.  HMM is designed to provide unified virtual address space w/o a lot of advanced hardware feature you listed above. 

I am aware Nvidia's new hardware platforms such as Grace Hopper natively support the Unified Memory programming model through hardware-based memory coherence among all CPUs and GPUs. For such systems, HMM is not required.

You can think HMM as a software based solution to provide unified address space b/t cpu and devices. Both AMD and Nvidia have been providing unified address space through hmm. I think it is still valuable.

Regards,
Oak  

> Cheers, Sima
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch