Re: [RFC 00/29] Introduce NVIDIA GPU Virtualization (vGPU) Support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Sep 22, 2024 at 04:11:21PM +0300, Zhi Wang wrote:
> On Sun, 22 Sep 2024 05:49:22 -0700
> Zhi Wang <zhiw@xxxxxxxxxx> wrote:
> 
> +Ben.
> 
> Forget to add you. My bad. 

Please also add the driver maintainers!

I had to fetch the patchset from the KVM list, since they did not hit the
nouveau list (I'm trying to get @nvidia.com addresses whitelisted).

- Danilo

>  
> 
> > 1. Background
> > =============
> > 
> > NVIDIA vGPU[1] software enables powerful GPU performance for workloads
> > ranging from graphics-rich virtual workstations to data science and
> > AI, enabling IT to leverage the management and security benefits of
> > virtualization as well as the performance of NVIDIA GPUs required for
> > modern workloads. Installed on a physical GPU in a cloud or enterprise
> > data center server, NVIDIA vGPU software creates virtual GPUs that can
> > be shared across multiple virtual machines.
> > 
> > The vGPU architecture[2] can be illustrated as follow:
> > 
> >  +--------------------+    +--------------------+
> > +--------------------+ +--------------------+ | Hypervisor         |
> >   | Guest VM           | | Guest VM           | | Guest VM
> > | |                    |    | +----------------+ | |
> > +----------------+ | | +----------------+ | | +----------------+ |
> > | |Applications... | | | |Applications... | | | |Applications... | |
> > | |  NVIDIA        | |    | +----------------+ | | +----------------+
> > | | +----------------+ | | |  Virtual GPU   | |    |
> > +----------------+ | | +----------------+ | | +----------------+ | |
> > |  Manager       | |    | |  Guest Driver  | | | |  Guest Driver  | |
> > | |  Guest Driver  | | | +------^---------+ |    | +----------------+
> > | | +----------------+ | | +----------------+ | |        |
> > |    +---------^----------+ +----------^---------+
> > +----------^---------+ |        |           |              |
> >              |                      | |        |
> > +--------------+-----------------------+----------------------+---------+
> > |        |                          |                       |
> >              |         | |        |                          |
> >                |                      |         |
> > +--------+--------------------------+-----------------------+----------------------+---------+
> > +---------v--------------------------+-----------------------+----------------------+----------+
> > | NVIDIA                  +----------v---------+
> > +-----------v--------+ +-----------v--------+ | | Physical GPU
> >     |   Virtual GPU      | |   Virtual GPU      | |   Virtual GPU
> >  | | |                         +--------------------+
> > +--------------------+ +--------------------+ |
> > +----------------------------------------------------------------------------------------------+
> > 
> > Each NVIDIA vGPU is analogous to a conventional GPU, having a fixed
> > amount of GPU framebuffer, and one or more virtual display outputs or
> > "heads". The vGPU’s framebuffer is allocated out of the physical
> > GPU’s framebuffer at the time the vGPU is created, and the vGPU
> > retains exclusive use of that framebuffer until it is destroyed.
> > 
> > The number of physical GPUs that a board has depends on the board.
> > Each physical GPU can support several different types of virtual GPU
> > (vGPU). vGPU types have a fixed amount of frame buffer, number of
> > supported display heads, and maximum resolutions. They are grouped
> > into different series according to the different classes of workload
> > for which they are optimized. Each series is identified by the last
> > letter of the vGPU type name.
> > 
> > NVIDIA vGPU supports Windows and Linux guest VM operating systems. The
> > supported vGPU types depend on the guest VM OS.
> > 
> > 2. Proposal for upstream
> > ========================
> > 
> > 2.1 Architecture
> > ----------------
> > 
> > Moving to the upstream, the proposed architecture can be illustrated
> > as followings:
> > 
> >                             +--------------------+
> > +--------------------+ +--------------------+ | Linux VM           |
> > | Windows VM         | | Guest VM           | | +----------------+ |
> > | +----------------+ | | +----------------+ | | |Applications... | |
> > | |Applications... | | | |Applications... | | | +----------------+ |
> > | +----------------+ | | +----------------+ | ... |
> > +----------------+ | | +----------------+ | | +----------------+ | |
> > |  Guest Driver  | | | |  Guest Driver  | | | |  Guest Driver  | | |
> > +----------------+ | | +----------------+ | | +----------------+ |
> > +---------^----------+ +----------^---------+ +----------^---------+
> > |                       |                      |
> > +--------------------------------------------------------------------+
> > |+--------------------+ +--------------------+
> > +--------------------+| ||       QEMU         | |       QEMU
> > | |       QEMU         || ||                    | |
> >  | |                    || |+--------------------+
> > +--------------------+ +--------------------+|
> > +--------------------------------------------------------------------+
> > |                       |                      |
> > +-----------------------------------------------------------------------------------------------+
> > |
> > +----------------------------------------------------------------+  |
> > |                           |                                VFIO
> >                        |  | |                           |
> >                                                    |  | |
> > +-----------------------+ | +------------------------+
> > +---------------------------------+|  | | |  Core Driver vGPU     | |
> > |                        |  |                                 ||  | |
> > |       Support        <--->|                       <---->
> >                     ||  | | +-----------------------+ | | NVIDIA vGPU
> > Manager    |  | NVIDIA vGPU VFIO Variant Driver ||  | | |    NVIDIA
> > GPU Core    | | |                        |  |
> >         ||  | | |        Driver         | |
> > +------------------------+  +---------------------------------+|  | |
> > +--------^--------------+
> > +----------------------------------------------------------------+  |
> > |          |                          |                       |
> >                |          |
> > +-----------------------------------------------------------------------------------------------+
> > |                          |                       |
> >     |
> > +----------|--------------------------|-----------------------|----------------------|----------+
> > |          v               +----------v---------+
> > +-----------v--------+ +-----------v--------+ | |  NVIDIA
> >      |       PCI VF       | |       PCI VF       | |       PCI VF
> >   | | |  Physical GPU            |                    | |
> >        | |                    | | |                          |
> > (Virtual GPU)    | |   (Virtual GPU)    | |    (Virtual GPU)   | | |
> >                         +--------------------+ +--------------------+
> > +--------------------+ |
> > +-----------------------------------------------------------------------------------------------+
> > 
> > The supported GPU generations will be Ada which come with the
> > supported GPU architecture. Each vGPU is backed by a PCI virtual
> > function.
> > 
> > The NVIDIA vGPU VFIO module together with VFIO sits on VFs, provides
> > extended management and features, e.g. selecting the vGPU types,
> > support live migration and driver warm update.
> > 
> > Like other devices that VFIO supports, VFIO provides the standard
> > userspace APIs for device lifecycle management and advance feature
> > support.
> > 
> > The NVIDIA vGPU manager provides necessary support to the NVIDIA vGPU
> > VFIO variant driver to create/destroy vGPUs, query available vGPU
> > types, select the vGPU type, etc.
> > 
> > On the other side, NVIDIA vGPU manager talks to the NVIDIA GPU core
> > driver, which provide necessary support to reach the HW functions.
> > 
> > 2.2 Requirements to the NVIDIA GPU core driver
> > ----------------------------------------------
> > 
> > The primary use case of CSP and enterprise is a standalone minimal
> > drivers of vGPU manager and other necessary components.
> > 
> > NVIDIA vGPU manager talks to the NVIDIA GPU core driver, which provide
> > necessary support to:
> > 
> > - Load the GSP firmware, boot the GSP, provide commnication channel.
> > - Manage the shared/partitioned HW resources. E.g. reserving FB
> > memory, channels for the vGPU mananger to create vGPUs.
> > - Exception handling. E.g. delivering the GSP events to vGPU manager.
> > - Host event dispatch. E.g. suspend/resume.
> > - Enumerations of HW configuration.
> > 
> > The NVIDIA GPU core driver, which sits on the PCI device interface of
> > NVIDIA GPU, provides support to both DRM driver and the vGPU manager.
> > 
> > In this RFC, the split nouveau GPU driver[3] is used as an example to
> > demostrate the requirements of vGPU manager to the core driver. The
> > nouveau driver is split into nouveau (the DRM driver) and nvkm (the
> > core driver).
> > 
> > 3 Try the RFC patches
> > -----------------------
> > 
> > The RFC supports to create one VM to test the simple GPU workload.
> > 
> > - Host kernel:
> > https://github.com/zhiwang-nvidia/linux/tree/zhi/vgpu-mgr-rfc
> > - Guest driver package: NVIDIA-Linux-x86_64-535.154.05.run [4]
> > 
> >   Install guest driver:
> >   # export GRID_BUILD=1
> >   # ./NVIDIA-Linux-x86_64-535.154.05.run
> > 
> > - Tested platforms: L40.
> > - Tested guest OS: Ubutnu 24.04 LTS.
> > - Supported experience: Linux rich desktop experience with simple 3D
> >   workload, e.g. glmark2
> > 
> > 4 Demo
> > ------
> > 
> > A demo video can be found at: https://youtu.be/YwgIvvk-V94
> > 
> > [1] https://www.nvidia.com/en-us/data-center/virtual-solutions/
> > [2]
> > https://docs.nvidia.com/vgpu/17.0/grid-vgpu-user-guide/index.html#architecture-grid-vgpu
> > [3]
> > https://lore.kernel.org/dri-devel/20240613170211.88779-1-bskeggs@xxxxxxxxxx/T/
> > [4]
> > https://us.download.nvidia.com/XFree86/Linux-x86_64/535.154.05/NVIDIA-Linux-x86_64-535.154.05.run
> > 
> > Zhi Wang (29):
> >   nvkm/vgpu: introduce NVIDIA vGPU support prelude
> >   nvkm/vgpu: attach to nvkm as a nvkm client
> >   nvkm/vgpu: reserve a larger GSP heap when NVIDIA vGPU is enabled
> >   nvkm/vgpu: set the VF partition count when NVIDIA vGPU is enabled
> >   nvkm/vgpu: populate GSP_VF_INFO when NVIDIA vGPU is enabled
> >   nvkm/vgpu: set RMSetSriovMode when NVIDIA vGPU is enabled
> >   nvkm/gsp: add a notify handler for GSP event
> >     GPUACCT_PERFMON_UTIL_SAMPLES
> >   nvkm/vgpu: get the size VMMU segment from GSP firmware
> >   nvkm/vgpu: introduce the reserved channel allocator
> >   nvkm/vgpu: introduce interfaces for NVIDIA vGPU VFIO module
> >   nvkm/vgpu: introduce GSP RM client alloc and free for vGPU
> >   nvkm/vgpu: introduce GSP RM control interface for vGPU
> >   nvkm: move chid.h to nvkm/engine.
> >   nvkm/vgpu: introduce channel allocation for vGPU
> >   nvkm/vgpu: introduce FB memory allocation for vGPU
> >   nvkm/vgpu: introduce BAR1 map routines for vGPUs
> >   nvkm/vgpu: introduce engine bitmap for vGPU
> >   nvkm/vgpu: introduce pci_driver.sriov_configure() in nvkm
> >   vfio/vgpu_mgr: introdcue vGPU lifecycle management prelude
> >   vfio/vgpu_mgr: allocate GSP RM client for NVIDIA vGPU manager
> >   vfio/vgpu_mgr: introduce vGPU type uploading
> >   vfio/vgpu_mgr: allocate vGPU FB memory when creating vGPUs
> >   vfio/vgpu_mgr: allocate vGPU channels when creating vGPUs
> >   vfio/vgpu_mgr: allocate mgmt heap when creating vGPUs
> >   vfio/vgpu_mgr: map mgmt heap when creating a vGPU
> >   vfio/vgpu_mgr: allocate GSP RM client when creating vGPUs
> >   vfio/vgpu_mgr: bootload the new vGPU
> >   vfio/vgpu_mgr: introduce vGPU host RPC channel
> >   vfio/vgpu_mgr: introduce NVIDIA vGPU VFIO variant driver
> > 
> >  .../drm/nouveau/include/nvkm/core/device.h    |   3 +
> >  .../drm/nouveau/include/nvkm/engine/chid.h    |  29 +
> >  .../gpu/drm/nouveau/include/nvkm/subdev/gsp.h |   1 +
> >  .../nouveau/include/nvkm/vgpu_mgr/vgpu_mgr.h  |  45 ++
> >  .../nvidia/inc/ctrl/ctrl2080/ctrl2080gpu.h    |  12 +
> >  drivers/gpu/drm/nouveau/nvkm/Kbuild           |   1 +
> >  drivers/gpu/drm/nouveau/nvkm/device/pci.c     |  33 +-
> >  .../gpu/drm/nouveau/nvkm/engine/fifo/chid.c   |  49 +-
> >  .../gpu/drm/nouveau/nvkm/engine/fifo/chid.h   |  26 +-
> >  .../gpu/drm/nouveau/nvkm/engine/fifo/r535.c   |   3 +
> >  .../gpu/drm/nouveau/nvkm/subdev/gsp/r535.c    |  14 +-
> >  drivers/gpu/drm/nouveau/nvkm/vgpu_mgr/Kbuild  |   3 +
> >  drivers/gpu/drm/nouveau/nvkm/vgpu_mgr/vfio.c  | 302 +++++++++++
> >  .../gpu/drm/nouveau/nvkm/vgpu_mgr/vgpu_mgr.c  | 234 ++++++++
> >  drivers/vfio/pci/Kconfig                      |   2 +
> >  drivers/vfio/pci/Makefile                     |   2 +
> >  drivers/vfio/pci/nvidia-vgpu/Kconfig          |  13 +
> >  drivers/vfio/pci/nvidia-vgpu/Makefile         |   8 +
> >  drivers/vfio/pci/nvidia-vgpu/debug.h          |  18 +
> >  .../nvidia/inc/ctrl/ctrl0000/ctrl0000system.h |  30 +
> >  .../nvidia/inc/ctrl/ctrl2080/ctrl2080gpu.h    |  33 ++
> >  .../ctrl/ctrl2080/ctrl2080vgpumgrinternal.h   | 152 ++++++
> >  .../common/sdk/nvidia/inc/ctrl/ctrla081.h     | 109 ++++
> >  .../nvrm/common/sdk/nvidia/inc/dev_vgpu_gsp.h | 213 ++++++++
> >  .../common/sdk/nvidia/inc/nv_vgpu_types.h     |  51 ++
> >  .../common/sdk/vmioplugin/inc/vmioplugin.h    |  26 +
> >  .../pci/nvidia-vgpu/include/nvrm/nvtypes.h    |  24 +
> >  drivers/vfio/pci/nvidia-vgpu/nvkm.h           |  94 ++++
> >  drivers/vfio/pci/nvidia-vgpu/rpc.c            | 242 +++++++++
> >  drivers/vfio/pci/nvidia-vgpu/vfio.h           |  43 ++
> >  drivers/vfio/pci/nvidia-vgpu/vfio_access.c    | 297 ++++++++++
> >  drivers/vfio/pci/nvidia-vgpu/vfio_main.c      | 511
> > ++++++++++++++++++ drivers/vfio/pci/nvidia-vgpu/vgpu.c           |
> > 352 ++++++++++++ drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c       | 144
> > +++++ drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h       |  89 +++
> >  drivers/vfio/pci/nvidia-vgpu/vgpu_types.c     | 466 ++++++++++++++++
> >  include/drm/nvkm_vgpu_mgr_vfio.h              |  61 +++
> >  37 files changed, 3702 insertions(+), 33 deletions(-)
> >  create mode 100644 drivers/gpu/drm/nouveau/include/nvkm/engine/chid.h
> >  create mode 100644
> > drivers/gpu/drm/nouveau/include/nvkm/vgpu_mgr/vgpu_mgr.h create mode
> > 100644 drivers/gpu/drm/nouveau/nvkm/vgpu_mgr/Kbuild create mode
> > 100644 drivers/gpu/drm/nouveau/nvkm/vgpu_mgr/vfio.c create mode
> > 100644 drivers/gpu/drm/nouveau/nvkm/vgpu_mgr/vgpu_mgr.c create mode
> > 100644 drivers/vfio/pci/nvidia-vgpu/Kconfig create mode 100644
> > drivers/vfio/pci/nvidia-vgpu/Makefile create mode 100644
> > drivers/vfio/pci/nvidia-vgpu/debug.h create mode 100644
> > drivers/vfio/pci/nvidia-vgpu/include/nvrm/common/sdk/nvidia/inc/ctrl/ctrl0000/ctrl0000system.h
> > create mode 100644
> > drivers/vfio/pci/nvidia-vgpu/include/nvrm/common/sdk/nvidia/inc/ctrl/ctrl2080/ctrl2080gpu.h
> > create mode 100644
> > drivers/vfio/pci/nvidia-vgpu/include/nvrm/common/sdk/nvidia/inc/ctrl/ctrl2080/ctrl2080vgpumgrinternal.h
> > create mode 100644
> > drivers/vfio/pci/nvidia-vgpu/include/nvrm/common/sdk/nvidia/inc/ctrl/ctrla081.h
> > create mode 100644
> > drivers/vfio/pci/nvidia-vgpu/include/nvrm/common/sdk/nvidia/inc/dev_vgpu_gsp.h
> > create mode 100644
> > drivers/vfio/pci/nvidia-vgpu/include/nvrm/common/sdk/nvidia/inc/nv_vgpu_types.h
> > create mode 100644
> > drivers/vfio/pci/nvidia-vgpu/include/nvrm/common/sdk/vmioplugin/inc/vmioplugin.h
> > create mode 100644
> > drivers/vfio/pci/nvidia-vgpu/include/nvrm/nvtypes.h create mode
> > 100644 drivers/vfio/pci/nvidia-vgpu/nvkm.h create mode 100644
> > drivers/vfio/pci/nvidia-vgpu/rpc.c create mode 100644
> > drivers/vfio/pci/nvidia-vgpu/vfio.h create mode 100644
> > drivers/vfio/pci/nvidia-vgpu/vfio_access.c create mode 100644
> > drivers/vfio/pci/nvidia-vgpu/vfio_main.c create mode 100644
> > drivers/vfio/pci/nvidia-vgpu/vgpu.c create mode 100644
> > drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c create mode 100644
> > drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h create mode 100644
> > drivers/vfio/pci/nvidia-vgpu/vgpu_types.c create mode 100644
> > include/drm/nvkm_vgpu_mgr_vfio.h
> > 
> 




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux