Danilo, Maybe before I give up, I should also ask, currently drm_gpuvm is designed for BO-centric world. Is it easy to make the va range split/merge work simply for va range, but without BO? Conceptually this should work as we are merge/splitting virtual address range which can be decoupled completely from BO. > -----Original Message----- > From: dri-devel <dri-devel-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Zeng, > Oak > Sent: Tuesday, January 23, 2024 10:57 PM > To: Danilo Krummrich <dakr@xxxxxxxxxx>; Christian König > <christian.koenig@xxxxxxx>; Dave Airlie <airlied@xxxxxxxxxx>; Daniel Vetter > <daniel@xxxxxxxx>; Felix Kuehling <felix.kuehling@xxxxxxx>; Welty, Brian > <brian.welty@xxxxxxxxx> > Cc: Brost, Matthew <matthew.brost@xxxxxxxxx>; > Thomas.Hellstrom@xxxxxxxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx; Ghimiray, > Himal Prasad <himal.prasad.ghimiray@xxxxxxxxx>; Gupta, saurabhg > <saurabhg.gupta@xxxxxxxxx>; Bommu, Krishnaiah > <krishnaiah.bommu@xxxxxxxxx>; Vishwanathapura, Niranjana > <niranjana.vishwanathapura@xxxxxxxxx>; intel-xe@xxxxxxxxxxxxxxxxxxxxx > Subject: RE: Making drm_gpuvm work across gpu devices > > Thanks a lot Danilo. > > Maybe I wasn't clear enough. In the solution I proposed, each device still have > separate vm/page tables. Each device still need to manage the mapping, page > table flags etc. It is just in svm use case, all devices share one drm_gpuvm > instance. As I understand it, drm_gpuvm's main function is the va range split and > merging. I don't see why it doesn't work across gpu devices. > > But I read more about drm_gpuvm. Its split merge function takes a > drm_gem_object parameter, see drm_gpuvm_sm_map_ops_create and > drm_gpuvm_sm_map. Actually the whole drm_gpuvm is designed for BO-centric > driver, for example, it has a drm_gpuvm_bo concept to keep track of the > 1BO:Ngpuva mapping. The whole purpose of leveraging drm_gpuvm is to re-use > the va split/merge functions for SVM. But in our SVM implementation, there is no > buffer object at all. So I don't think our SVM codes can leverage drm_gpuvm. > > I will give up this approach, unless Matt or Brian can see a way. > > A few replies inline.... @Welty, Brian I had more thoughts inline to one of your > original question.... > > > -----Original Message----- > > From: Danilo Krummrich <dakr@xxxxxxxxxx> > > Sent: Tuesday, January 23, 2024 6:57 PM > > To: Zeng, Oak <oak.zeng@xxxxxxxxx>; Christian König > > <christian.koenig@xxxxxxx>; Dave Airlie <airlied@xxxxxxxxxx>; Daniel Vetter > > <daniel@xxxxxxxx>; Felix Kuehling <felix.kuehling@xxxxxxx> > > Cc: Welty, Brian <brian.welty@xxxxxxxxx>; dri-devel@xxxxxxxxxxxxxxxxxxxxx; > intel- > > xe@xxxxxxxxxxxxxxxxxxxxx; Bommu, Krishnaiah <krishnaiah.bommu@xxxxxxxxx>; > > Ghimiray, Himal Prasad <himal.prasad.ghimiray@xxxxxxxxx>; > > Thomas.Hellstrom@xxxxxxxxxxxxxxx; Vishwanathapura, Niranjana > > <niranjana.vishwanathapura@xxxxxxxxx>; Brost, Matthew > > <matthew.brost@xxxxxxxxx>; Gupta, saurabhg <saurabhg.gupta@xxxxxxxxx> > > Subject: Re: Making drm_gpuvm work across gpu devices > > > > Hi Oak, > > > > On 1/23/24 20:37, Zeng, Oak wrote: > > > Thanks Christian. I have some comment inline below. > > > > > > Danilo, can you also take a look and give your feedback? Thanks. > > > > I agree with everything Christian already wrote. Except for the KFD parts, which > > I'm simply not familiar with, I had exactly the same thoughts after reading your > > initial mail. > > > > Please find some more comments below. > > > > > > > >> -----Original Message----- > > >> From: Christian König <christian.koenig@xxxxxxx> > > >> Sent: Tuesday, January 23, 2024 6:13 AM > > >> To: Zeng, Oak <oak.zeng@xxxxxxxxx>; Danilo Krummrich > <dakr@xxxxxxxxxx>; > > >> Dave Airlie <airlied@xxxxxxxxxx>; Daniel Vetter <daniel@xxxxxxxx> > > >> Cc: Welty, Brian <brian.welty@xxxxxxxxx>; dri-devel@xxxxxxxxxxxxxxxxxxxxx; > > intel- > > >> xe@xxxxxxxxxxxxxxxxxxxxx; Bommu, Krishnaiah > > <krishnaiah.bommu@xxxxxxxxx>; > > >> Ghimiray, Himal Prasad <himal.prasad.ghimiray@xxxxxxxxx>; > > >> Thomas.Hellstrom@xxxxxxxxxxxxxxx; Vishwanathapura, Niranjana > > >> <niranjana.vishwanathapura@xxxxxxxxx>; Brost, Matthew > > >> <matthew.brost@xxxxxxxxx> > > >> Subject: Re: Making drm_gpuvm work across gpu devices > > >> > > >> Hi Oak, > > >> > > >> Am 23.01.24 um 04:21 schrieb Zeng, Oak: > > >>> Hi Danilo and all, > > >>> > > >>> During the work of Intel's SVM code, we came up the idea of making > > >> drm_gpuvm to work across multiple gpu devices. See some discussion here: > > >> https://lore.kernel.org/dri- > > >> > > > devel/PH7PR11MB70049E7E6A2F40BF6282ECC292742@PH7PR11MB7004.namprd > > >> 11.prod.outlook.com/ > > >>> > > >>> The reason we try to do this is, for a SVM (shared virtual memory across > cpu > > >> program and all gpu program on all gpu devices) process, the address space > > has > > >> to be across all gpu devices. So if we make drm_gpuvm to work across > devices, > > >> then our SVM code can leverage drm_gpuvm as well. > > >>> > > >>> At a first look, it seems feasible because drm_gpuvm doesn't really use the > > >> drm_device *drm pointer a lot. This param is used only for printing/warning. > > So I > > >> think maybe we can delete this drm field from drm_gpuvm. > > >>> > > >>> This way, on a multiple gpu device system, for one process, we can have > only > > >> one drm_gpuvm instance, instead of multiple drm_gpuvm instances (one > for > > >> each gpu device). > > >>> > > >>> What do you think? > > >> > > >> Well from the GPUVM side I don't think it would make much difference if > > >> we have the drm device or not. > > >> > > >> But the experience we had with the KFD I think I should mention that we > > >> should absolutely *not* deal with multiple devices at the same time in > > >> the UAPI or VM objects inside the driver. > > >> > > >> The background is that all the APIs inside the Linux kernel are build > > >> around the idea that they work with only one device at a time. This > > >> accounts for both low level APIs like the DMA API as well as pretty high > > >> level things like for example file system address space etc... > > > > > > Yes most API are per device based. > > > > > > One exception I know is actually the kfd SVM API. If you look at the svm_ioctl > > function, it is per-process based. Each kfd_process represent a process across N > > gpu devices. Cc Felix. > > > > > > Need to say, kfd SVM represent a shared virtual address space across CPU > and > > all GPU devices on the system. This is by the definition of SVM (shared virtual > > memory). This is very different from our legacy gpu *device* driver which > works > > for only one device (i.e., if you want one device to access another device's > > memory, you will have to use dma-buf export/import etc). > > > > > > We have the same design requirement of SVM. For anyone who want to > > implement the SVM concept, this is a hard requirement. Since now drm has the > > drm_gpuvm concept which strictly speaking is designed for one device, I want > to > > see whether we can extend drm_gpuvm to make it work for both single device > > (as used in xe) and multipe devices (will be used in the SVM code). That is why I > > brought up this topic. > > > > > >> > > >> So when you have multiple GPUs you either have an inseparable cluster of > > >> them which case you would also only have one drm_device. Or you have > > >> separated drm_device which also results in separate drm render nodes and > > >> separate virtual address spaces and also eventually separate IOMMU > > >> domains which gives you separate dma_addresses for the same page and so > > >> separate GPUVM page tables.... > > > > > > I am thinking we can still make each device has its separate > drm_device/render > > node/iommu domains/gpu page table. Just as what we have today. I am not > plan > > to change this picture. > > > > > > But the virtual address space will support two modes of operation: > > > 1. one drm_gpuvm per device. This is when svm is not in the picture > > > 2. all devices in the process share one single drm_gpuvm, when svm is in the > > picture. In xe driver design, we have to support a mixture use of legacy mode > > (such as gem_create and vm_bind) and svm (such as malloc'ed memory for gpu > > submission). So whenever SVM is in the picture, we want one single process > > address space across all devices. Drm_gpuvm doesn't need to be aware of > those > > two operation modes. It is driver's responsibility to use different mode. > > > > > > For example, in mode #1, a driver's vm structure (such as xe_vm) can inherit > > from drm_gpuvm. In mode #2, a driver's svm structure (xe_svm in this series: > > https://lore.kernel.org/dri-devel/20240117221223.18540-1- > oak.zeng@xxxxxxxxx/) > > can inherit from drm_gpuvm while each xe_vm (still a per-device based struct) > > will just have a pointer to the drm_gpuvm structure. This way when svm is in > play, > > we build a 1 process:1 mm_struct:1 xe_svm:N xe_vm correlations which means > > shared address space across gpu devices. > > > > With a shared GPUVM structure, how do you track actual per device resources > > such as > > page tables? You also need to consider that the page table layout, memory > > mapping > > flags may vary from device to device due to different GPU chipsets or revisions. > > The per device page table, flags etc are still managed per-device based, which is > the xe_vm in the xekmd driver. > > > > > Also, if you replace the shared GPUVM structure with a pointer to a shared one, > > you may run into all kinds of difficulties due to increasing complexity in terms > > of locking, synchronization, lifetime and potential unwind operations in error > > paths. > > I haven't thought it through yet, but I wouldn't be surprised entirely if there are > > cases where you simply run into circular dependencies. > > Make sense, I can't see through this without a prove of concept code either. > > > > > Also, looking at the conversation in the linked patch series: > > > > <snip> > > > > >> For example as hmm_range_fault brings a range from host into GPU address > > >> space, what if it was already allocated and in use by VM_BIND for > > >> a GEM_CREATE allocated buffer? That is of course application error, > > >> but KMD needs to detect it, and provide one single managed address > > >> space across all allocations from the application.... > > > > > This is very good question. Yes agree we should check this application error. > > Fortunately this is doable. All vm_bind virtual address range are tracked in > > xe_vm/drm_gpuvm struct. In this case, we should iterate the drm_gpuvm's rb > > tree of *all* gpu devices (as xe_vm is for one device only) to see whether > there > > is a conflict. Will make the change soon. > > > > <snip> > > > > How do you do that if xe_vm->gpuvm is just a pointer to the GPUVM structure > > within xe_svm? > > In the proposed approach, we have a single drm_gpuvm instance for one process. > All device's xe_vm pointing to this drm_gpuvm instance. This drm_gpuvm's rb > tree maintains all the va range we have in this process. We can just walk this rb > tree to see if there is a conflict. > > But I didn't answer Brian's question completely... In a mixed use of vm_bind and > malloc/mmap, the virtual address used by vm_bind should first be reserved in > user space using mmap. So all valid virtual address should be tracked by linux > kernel vma_struct. > > Both vm_bind and malloc'ed virtual address can cause a gpu page fault. Our fault > handler should first see whether this is a vm_bind va and service the fault > accordingly; if not, then serve the fault in the SVM path; if SVM path also failed, it > is an invalid address. So from user perspective, user can use: > Ptr = mmap() > Vm_bind(ptr, bo) > Submit gpu kernel using ptr > Or: > Ptr = mmap() > Submit gpu kernel using ptr > Whether vm_bind is called or not decides the gpu fault handler code path. > Hopefully this answers @Welty, Brian's original question > > > > > > > > > > This requires some changes of drm_gpuvm design: > > > 1. The drm_device *drm pointer, in mode #2 operation, this can be NULL, > > means this drm_gpuvm is not for specific gpu device > > > 2. The common dma_resv object: drm_gem_object *r_obj. *Does one > > dma_resv object allocated/initialized for one device work for all devices*? From > > first look, dma_resv is just some CPU structure maintaining dma-fences. So I > > guess it should work for all devices? I definitely need you to comment. > > > > The general rule is that drivers can share the common dma_resv across GEM > > objects that > > are only mapped within the VM owning the dma_resv, but never within > another > > VM. > > > > Now, your question is whether multiple VMs can share the same common > > dma_resv. I think > > that calls for trouble, since it would create dependencies that simply aren't > > needed > > and might even introduce locking issues. > > > > However, that's optional, you can simply decide to not make use of the > common > > dma_resv > > and all the optimizations based on it. > > Ok, got it. > > > > > > > > > > >> > > >> It's up to you how to implement it, but I think it's pretty clear that > > >> you need separate drm_gpuvm objects to manage those. > > > > > > As explained above, I am thinking of one drm_gpuvm object across all devices > > when SVM is in the picture... > > > > > >> > > >> That you map the same thing in all those virtual address spaces at the > > >> same address is a completely different optimization problem I think. > > > > > > Not sure I follow here... the requirement from SVM is, one virtual address > > points to same physical backing store. For example, whenever CPU or any GPU > > device access this virtual address, it refers to the same physical content. Of > > course the physical backing store can be migrated b/t host memory and any of > > the GPU's device memory, but the content should be consistent. > > > > Technically, multiple different GPUs will have separate virtual address spaces, > it's > > just that you create mappings within all of them such that the same virtual > > address > > resolves to the same physical content on all of them. > > > > So, having a single GPUVM instance representing all of them might give the > > illusion of > > a single unified address space, but you still need to maintain each device's > > address > > space backing resources, such as page tables, separately. > > Yes agreed. > > Regards, > Oak > > > > - Danilo > > > > > > > > So we are mapping same physical content to the same virtual address in > either > > cpu page table or any gpu device's page table... > > > > > >> What we could certainly do is to optimize hmm_range_fault by making > > >> hmm_range a reference counted object and using it for multiple devices > > >> at the same time if those devices request the same range of an mm_struct. > > >> > > > > > > Not very follow. If you are trying to resolve a multiple devices concurrent > access > > problem, I think we should serialize concurrent device fault to one address > range. > > The reason is, during device fault handling, we might migrate the backing store > so > > hmm_range->hmm_pfns[] might have changed after one device access it. > > > > > >> I think if you start using the same drm_gpuvm for multiple devices you > > >> will sooner or later start to run into the same mess we have seen with > > >> KFD, where we moved more and more functionality from the KFD to the > DRM > > >> render node because we found that a lot of the stuff simply doesn't work > > >> correctly with a single object to maintain the state. > > > > > > As I understand it, KFD is designed to work across devices. A single pseudo > > /dev/kfd device represent all hardware gpu devices. That is why during kfd > open, > > many pdd (process device data) is created, each for one hardware device for > this > > process. Yes the codes are a little complicated. > > > > > > Kfd manages the shared virtual address space in the kfd driver codes, like the > > split, merging etc. Here I am looking whether we can leverage the drm_gpuvm > > code for those functions. > > > > > > As of the shared virtual address space across gpu devices, it is a hard > > requirement for svm/system allocator (aka malloc for gpu program). We need > to > > make it work either at driver level or drm_gpuvm level. Drm_gpuvm is better > > because the work can be shared b/t drivers. > > > > > > Thanks a lot, > > > Oak > > > > > >> > > >> Just one more point to your original discussion on the xe list: I think > > >> it's perfectly valid for an application to map something at the same > > >> address you already have something else. > > >> > > >> Cheers, > > >> Christian. > > >> > > >>> > > >>> Thanks, > > >>> Oak > > >