Hi, Christian. On Thu, 2024-02-29 at 10:41 +0100, Christian König wrote: > Am 28.02.24 um 20:51 schrieb Zeng, Oak: > > > > The mail wasn’t indent/preface correctly. Manually format it. > > > > *From:*Christian König <christian.koenig@xxxxxxx> > > *Sent:* Tuesday, February 27, 2024 1:54 AM > > *To:* Zeng, Oak <oak.zeng@xxxxxxxxx>; Danilo Krummrich > > <dakr@xxxxxxxxxx>; Dave Airlie <airlied@xxxxxxxxxx>; Daniel Vetter > > <daniel@xxxxxxxx>; Felix Kuehling <felix.kuehling@xxxxxxx>; > > jglisse@xxxxxxxxxx > > *Cc:* Welty, Brian <brian.welty@xxxxxxxxx>; > > dri-devel@xxxxxxxxxxxxxxxxxxxxx; intel-xe@xxxxxxxxxxxxxxxxxxxxx; > > Bommu, Krishnaiah <krishnaiah.bommu@xxxxxxxxx>; Ghimiray, Himal > > Prasad > > <himal.prasad.ghimiray@xxxxxxxxx>; > > Thomas.Hellstrom@xxxxxxxxxxxxxxx; > > Vishwanathapura, Niranjana <niranjana.vishwanathapura@xxxxxxxxx>; > > Brost, Matthew <matthew.brost@xxxxxxxxx>; Gupta, saurabhg > > <saurabhg.gupta@xxxxxxxxx> > > *Subject:* Re: Making drm_gpuvm work across gpu devices > > > > Hi Oak, > > > > Am 23.02.24 um 21:12 schrieb Zeng, Oak: > > > > Hi Christian, > > > > I go back this old email to ask a question. > > > > > > sorry totally missed that one. > > > > Quote from your email: > > > > “Those ranges can then be used to implement the SVM feature > > required for higher level APIs and not something you need at > > the > > UAPI or even inside the low level kernel memory management.” > > > > “SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This > > should not have any influence on the design of the kernel > > UAPI.” > > > > There are two category of SVM: > > > > 1.driver svm allocator: this is implemented in user space, > > i.g., > > cudaMallocManaged (cuda) or zeMemAllocShared (L0) or > > clSVMAlloc(openCL). Intel already have gem_create/vm_bind in > > xekmd > > and our umd implemented clSVMAlloc and zeMemAllocShared on top > > of > > gem_create/vm_bind. Range A..B of the process address space is > > mapped into a range C..D of the GPU address space, exactly as > > you > > said. > > > > 2.system svm allocator: This doesn’t introduce extra driver > > API > > for memory allocation. Any valid CPU virtual address can be > > used > > directly transparently in a GPU program without any extra > > driver > > API call. Quote from kernel Documentation/vm/hmm.hst: “Any > > application memory region (private anonymous, shared memory, or > > regular file backed memory) can be used by a device > > transparently” > > and “to share the address space by duplicating the CPU page > > table > > in the device page table so the same address points to the same > > physical memory for any valid main memory address in the > > process > > address space”. In system svm allocator, we don’t need that > > A..B > > C..D mapping. > > > > It looks like you were talking of 1). Were you? > > > > > > No, even when you fully mirror the whole address space from a > > process > > into the GPU you still need to enable this somehow with an IOCTL. > > > > And while enabling this you absolutely should specify to which part > > of > > the address space this mirroring applies and where it maps to. > > > > */[Zeng, Oak] /* > > > > Lets say we have a hardware platform where both CPU and GPU support > > 57bit(use it for example. The statement apply to any address range) > > virtual address range, how do you decide “which part of the address > > space this mirroring applies”? You have to mirror the whole address > > space [0~2^57-1], do you? As you designed it, the gigantic > > vm_bind/mirroring happens at the process initialization time, and > > at > > that time, you don’t know which part of the address space will be > > used > > for gpu program. Remember for system allocator, *any* valid CPU > > address can be used for GPU program. If you add an offset to > > [0~2^57-1], you get an address out of 57bit address range. Is this > > a > > valid concern? > > > > Well you can perfectly mirror on demand. You just need something > similar > to userfaultfd() for the GPU. This way you don't need to mirror the > full > address space, but can rather work with large chunks created on > demand, > let's say 1GiB or something like that. What we're looking at as the current design is an augmented userptr (A..B -> C..D mapping) which is internally sparsely populated in chunks. KMD manages the population using gpu pagefaults. We acknowledge that some parts of this mirror will not have a valid CPU mapping. That is, no vma so a gpu page-fault that resolves to such a mirror address will cause an error. Would you have any concerns / objections against such an approach? Thanks, Thomas