Re: Making drm_gpuvm work across gpu devices

Christian König <christian.koenig@xxxxxxx> · Tue, 27 Feb 2024 07:54:05 +0100



    Hi Oak,

    
    Am 23.02.24 um 21:12 schrieb Zeng, Oak:

    
        Hi Christian,
         
        I go back this old email to ask a question.
      
    
    sorry totally missed that one.

    
        Quote from your email:
        “Those ranges can then be used to implement
          the SVM feature required for higher level APIs and not
          something you need at the UAPI or even inside the low level
          kernel memory management.”
        “SVM is a high level concept of OpenCL,
          Cuda, ROCm etc.. This should not have any influence on the
          design of the kernel UAPI.”
         
        There are two
            category of SVM:
        
          driver svm allocator: this is
              implemented in user space,  i.g., cudaMallocManaged (cuda)
              or zeMemAllocShared (L0) or clSVMAlloc(openCL). Intel
              already have gem_create/vm_bind in xekmd and our umd
              implemented clSVMAlloc and zeMemAllocShared on top of
              gem_create/vm_bind.
            Range A..B of the process address space is mapped
            into a range C..D of the GPU address space, exactly as you
            said.
          system svm allocator:  This doesn’t
              introduce extra driver API for memory allocation. Any
              valid CPU virtual address can be used directly
              transparently in a GPU program without any extra driver
              API call. Quote from kernel Documentation/vm/hmm.hst: “Any
              application memory region (private anonymous, shared
              memory, or regular file backed memory) can be used by a
              device transparently” and “to share the address space by
              duplicating the CPU page table in the device page table so
              the same address points to the same physical memory for
              any valid main memory address in the process address space”. In system svm allocator, we don’t
              need that A..B C..D mapping.
        
         
        It looks like you were talking of 1). Were
          you?
      
    
    No, even when you fully mirror the whole address space from a
    process into the GPU you still need to enable this somehow with an
    IOCTL.

    
    And while enabling this you absolutely should specify to which part
    of the address space this mirroring applies and where it maps to.

    
    I see the system svm allocator as just a special case of the driver
    allocator where not fully backed buffer objects are allocated, but
    rather sparse one which are filled and migrated on demand.

    
    Regards,

    Christian.

    
        Oak
        
          
              From: Christian König
                  <christian.koenig@xxxxxxx>
                  

                  Sent: Wednesday, January 24, 2024 3:33 AM

                  To: Zeng, Oak <oak.zeng@xxxxxxxxx>;
                  Danilo Krummrich <dakr@xxxxxxxxxx>; Dave Airlie
                  <airlied@xxxxxxxxxx>; Daniel Vetter
                  <daniel@xxxxxxxx>; Felix Kuehling
                  <felix.kuehling@xxxxxxx>

                  Cc: Welty, Brian <brian.welty@xxxxxxxxx>;
                  dri-devel@xxxxxxxxxxxxxxxxxxxxx;
                  intel-xe@xxxxxxxxxxxxxxxxxxxxx; Bommu, Krishnaiah
                  <krishnaiah.bommu@xxxxxxxxx>; Ghimiray, Himal
                  Prasad <himal.prasad.ghimiray@xxxxxxxxx>;
                  Thomas.Hellstrom@xxxxxxxxxxxxxxx; Vishwanathapura,
                  Niranjana <niranjana.vishwanathapura@xxxxxxxxx>;
                  Brost, Matthew <matthew.brost@xxxxxxxxx>; Gupta,
                  saurabhg <saurabhg.gupta@xxxxxxxxx>

                  Subject: Re: Making drm_gpuvm work across gpu
                  devices
            
          
          Am 23.01.24 um 20:37 schrieb Zeng, Oak:

            
            [SNIP] 
             
            Yes most API are per device based.
             
            One exception I know is actually the kfd SVM API. If you look at the svm_ioctl function, it is per-process based. Each kfd_process represent a process across N gpu devices.
          
          
            Yeah and that was a big mistake in my opinion. We should
            really not do that ever again.

            
            Need to say, kfd SVM represent a shared virtual address space across CPU and all GPU devices on the system. This is by the definition of SVM (shared virtual memory). This is very different from our legacy gpu *device* driver which works for only one device (i.e., if you want one device to access another device's memory, you will have to use dma-buf export/import etc).
          
          
            Exactly that thinking is what we have currently found as
            blocker for a virtualization projects. Having SVM as device
            independent feature which somehow ties to the process
            address space turned out to be an extremely bad idea.

            
            The background is that this only works for some use cases
            but not all of them.

            
            What's working much better is to just have a mirror
            functionality which says that a range A..B of the process
            address space is mapped into a range C..D of the GPU address
            space.

            
            Those ranges can then be used to implement the SVM feature
            required for higher level APIs and not something you need at
            the UAPI or even inside the low level kernel memory
            management.

            
            When you talk about migrating memory to a device you also do
            this on a per device basis and *not* tied to the process
            address space. If you then get crappy performance because
            userspace gave contradicting information where to migrate
            memory then that's a bug in userspace and not something the
            kernel should try to prevent somehow.

            
            [SNIP]

            
              I think if you start using the same drm_gpuvm for multiple devices you
              will sooner or later start to run into the same mess we have seen with
              KFD, where we moved more and more functionality from the KFD to the DRM
              render node because we found that a lot of the stuff simply doesn't work
              correctly with a single object to maintain the state.
            
             
            As I understand it, KFD is designed to work across devices. A single pseudo /dev/kfd device represent all hardware gpu devices. That is why during kfd open, many pdd (process device data) is created, each for one hardware device for this process.
          
          
            Yes, I'm perfectly aware of that. And I can only repeat
            myself that I see this design as a rather extreme failure.
            And I think it's one of the reasons why NVidia is so
            dominant with Cuda.

            
            This whole approach KFD takes was designed with the idea of
            extending the CPU process into the GPUs, but this idea only
            works for a few use cases and is not something we should
            apply to drivers in general.

            
            A very good example are virtualization use cases where you
            end up with CPU address != GPU address because the VAs are
            actually coming from the guest VM and not the host process.

            
            SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This
            should not have any influence on the design of the kernel
            UAPI.

            
            If you want to do something similar as KFD for Xe I think
            you need to get explicit permission to do this from Dave and
            Daniel and maybe even Linus.

            
            Regards,

            Christian.