Re: Making drm_gpuvm work across gpu devices

Christian König <christian.koenig@xxxxxxx> · Fri, 1 Mar 2024 08:01:15 +0100

Hi Thomas,

Am 29.02.24 um 18:12 schrieb Thomas Hellström:
Hi, Christian.

On Thu, 2024-02-29 at 10:41 +0100, Christian König wrote:
Am 28.02.24 um 20:51 schrieb Zeng, Oak:
The mail wasn’t indent/preface correctly. Manually format it.

*From:*Christian König <christian.koenig@xxxxxxx>
*Sent:* Tuesday, February 27, 2024 1:54 AM
*To:* Zeng, Oak <oak.zeng@xxxxxxxxx>; Danilo Krummrich
<dakr@xxxxxxxxxx>; Dave Airlie <airlied@xxxxxxxxxx>; Daniel Vetter
<daniel@xxxxxxxx>; Felix Kuehling <felix.kuehling@xxxxxxx>;
jglisse@xxxxxxxxxx
*Cc:* Welty, Brian <brian.welty@xxxxxxxxx>;
dri-devel@xxxxxxxxxxxxxxxxxxxxx; intel-xe@xxxxxxxxxxxxxxxxxxxxx;
Bommu, Krishnaiah <krishnaiah.bommu@xxxxxxxxx>; Ghimiray, Himal
Prasad
<himal.prasad.ghimiray@xxxxxxxxx>;
Thomas.Hellstrom@xxxxxxxxxxxxxxx;
Vishwanathapura, Niranjana <niranjana.vishwanathapura@xxxxxxxxx>;
Brost, Matthew <matthew.brost@xxxxxxxxx>; Gupta, saurabhg
<saurabhg.gupta@xxxxxxxxx>
*Subject:* Re: Making drm_gpuvm work across gpu devices

Hi Oak,

Am 23.02.24 um 21:12 schrieb Zeng, Oak:

     Hi Christian,

     I go back this old email to ask a question.

sorry totally missed that one.

     Quote from your email:

     “Those ranges can then be used to implement the SVM feature
     required for higher level APIs and not something you need at
the
     UAPI or even inside the low level kernel memory management.”

     “SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This
     should not have any influence on the design of the kernel
UAPI.”

     There are two category of SVM:

     1.driver svm allocator: this is implemented in user space,
  i.g.,
     cudaMallocManaged (cuda) or zeMemAllocShared (L0) or
     clSVMAlloc(openCL). Intel already have gem_create/vm_bind in
xekmd
     and our umd implemented clSVMAlloc and zeMemAllocShared on top
of
     gem_create/vm_bind. Range A..B of the process address space is
     mapped into a range C..D of the GPU address space, exactly as
you
     said.

     2.system svm allocator:  This doesn’t introduce extra driver
API
     for memory allocation. Any valid CPU virtual address can be
used
     directly transparently in a GPU program without any extra
driver
     API call. Quote from kernel Documentation/vm/hmm.hst: “Any
     application memory region (private anonymous, shared memory, or
     regular file backed memory) can be used by a device
transparently”
     and “to share the address space by duplicating the CPU page
table
     in the device page table so the same address points to the same
     physical memory for any valid main memory address in the
process
     address space”. In system svm allocator, we don’t need that
A..B
     C..D mapping.

     It looks like you were talking of 1). Were you?

No, even when you fully mirror the whole address space from a
process
into the GPU you still need to enable this somehow with an IOCTL.

And while enabling this you absolutely should specify to which part
of
the address space this mirroring applies and where it maps to.

*/[Zeng, Oak] /*

Lets say we have a hardware platform where both CPU and GPU support
57bit(use it for example. The statement apply to any address range)
virtual address range, how do you decide “which part of the address
space this mirroring applies”? You have to mirror the whole address
space [0~2^57-1], do you? As you designed it, the gigantic
vm_bind/mirroring happens at the process initialization time, and
at
that time, you don’t know which part of the address space will be
used
for gpu program. Remember for system allocator, *any* valid CPU
address can be used for GPU program.  If you add an offset to
[0~2^57-1], you get an address out of 57bit address range. Is this
a
valid concern?

Well you can perfectly mirror on demand. You just need something
similar
to userfaultfd() for the GPU. This way you don't need to mirror the
full
address space, but can rather work with large chunks created on
demand,
let's say 1GiB or something like that.

What we're looking at as the current design is an augmented userptr
(A..B -> C..D mapping) which is internally sparsely populated in
chunks. KMD manages the population using gpu pagefaults. We acknowledge
that some parts of this mirror will not have a valid CPU mapping. That
is, no vma so a gpu page-fault that resolves to such a mirror address
will cause an error. Would you have any concerns / objections against
such an approach?

Nope, as far as I can see that sounds like a perfectly valid design to me.

Regards,
Christian.

Thanks,
Thomas