Re: [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 25/09/2024 12.11, Alejandro Lucero Palau wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 9/20/24 23:34, Zhi Wang wrote:
>> Hi folks:
>>
>> As promised in the LPC, here are all you need (patches, repos, guiding
>> video, kernel config) to build a environment to test the vfio-cxl-core.
>>
>> Thanks so much for the discussions! Enjoy and see you in the next one.
>>
>> Background
>> ==========
>>
>> Compute Express Link (CXL) is an open standard interconnect built upon
>> industrial PCI layers to enhance the performance and efficiency of data
>> centers by enabling high-speed, low-latency communication between CPUs
>> and various types of devices such as accelerators, memory.
>>
>> It supports three key protocols: CXL.io as the control protocol, 
>> CXL.cache
>> as the cache-coherent host-device data transfer protocol, and CXL.mem as
>> memory expansion protocol. CXL Type 2 devices leverage the three 
>> protocols
>> to seamlessly integrate with host CPUs, providing a unified and efficient
>> interface for high-speed data transfer and memory sharing. This 
>> integration
>> is crucial for heterogeneous computing environments where accelerators,
>> such as GPUs, and other specialized processors, are used to handle
>> intensive workloads.
>>
>> Goal
>> ====
>>
>> Although CXL is built upon the PCI layers, passing a CXL type-2 device 
>> can
>> be different than PCI devices according to CXL specification[1]:
>>
>> - CXL type-2 device initialization. CXL type-2 device requires an
>> additional initialization sequence besides the PCI device initialization.
>> CXL type-2 device initialization can be pretty complicated due to its
>> hierarchy of register interfaces. Thus, a standard CXL type-2 driver
>> initialization sequence provided by the kernel CXL core is used.
>>
>> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
>> (Device PA) needs to be created to access the device memory directly. HDM
>> decoders in the CXL topology need to be configured level by level to
>> manage the mapping. After the region is created, it needs to be mapped to
>> GPA in the virtual HDM decoders configured by the VM.
>>
>> - CXL reset. The CXL device reset is different from the PCI device reset.
>> A CXL reset sequence is introduced by the CXL spec.
>>
>> - Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the
>> configuration for device enumeration and device control. (E.g. if a 
>> device
>> is capable of CXL.mem CXL.cache, enable/disable capability) They are 
>> owned
>> by the kernel CXL core, and the VM can not modify them.
>>
>> - Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO 
>> registers
>> that can sit in a PCI BAR. The location of register groups sit in the PCI
>> BAR is indicated by the register locator in the CXL DVSECs. They are also
>> owned by the kernel CXL core. Some of them need to be emulated.
>>
>> Design
>> ======
>>
>> To achieve the purpose above, the vfio-cxl-core is introduced to host the
>> common routines that variant driver requires for device passthrough.
>> Similar with the vfio-pci-core, the vfio-cxl-core provides common
>> routines of vfio_device_ops for the variant driver to hook and perform 
>> the
>> CXL routines behind it.
>>
>> Besides, several extra APIs are introduced for the variant driver to
>> provide the necessary information the kernel CXL core to initialize
>> the CXL device. E.g., Device DPA.
>>
>> CXL is built upon the PCI layers but with differences. Thus, the
>> vfio-pci-core is aimed to be re-used as much as possible with the
>> awareness of operating on a CXL device.
>>
>> A new VFIO device region is introduced to expose the CXL region to the
>> userspace. A new CXL VFIO device cap has also been introduced to convey
>> the necessary CXL device information to the userspace.
> 
> 
> 
> Hi Zhi,
> 
> 
> As you know, I was confused with this work but after looking at the
> patchset and thinking about all this, it makes sense now. FWIW, the most
> confusing point was to use the CXL core inside the VM for creating the
> region what implies commits to the CXL root switch/complex and any other
> switch in the path. I realize now it will happen but on emulated
> hardware with no implication to the real one, which was updated with any
> necessary change like those commits by the vfio cxl code in the host (L1
> VM in your tests).
> 
> 
> The only problem I can see with this approach is the CXL initialization
> is left unconditionally to the hypervisor. I guess most of the time will
> be fine, but the driver could not be mapping/using the whole CXL mem
> always.  I know this could be awkward, but possible depending on the
> device state unrelated to CXL itself. 

Will this device states be one-time on/off state or a runtime 
configuration state that a guest need to poke all the time?

There can be two paths for handling these states in a vendor-specific 
variant driver: 1) vfio_device->fops->open() path, it suits for one-time 
on/off state 2) vfio_device->fops->{read, write}(), the VM 
exit->QEMU->variant driver path. The vendor-specific driver can 
configure the HW based on the register access from the guest.

It would be nice to know more about this, like how many registers the 
vendor-specific driver would like to handle. Thus, the VFIO CXL core can 
provide common helpers.

In other words, this approach
> assumes beforehand something which could not be true. What I had in mind
> was to have VM exits for any action on CXL configuration on behalf of
> that device/driver inside the device.
> 

Initially, this was a idea from Dan. I think this would be a good topic 
for the next CXL open-source collaboration meeting. Kevn also commented 
for this.

> 
> This is all more problematic with CXL.cache, and I think the same
> approach can not be followed. I'm writing a document trying to share all
> my concerns about CXL.cache and DMA/IOMMU mappings, and I will cover
> this for sure. As a quick note, while DMA/IOMMU has no limitations
> regarding the amount of memory to use, it is unlikely the same can be
> done due to scarce host snoop cache resources, therefore the CXL.cache
> mappings will likely need to be explicitly done by the driver and
> approved by the CXL core (along with DMA/IOMMU), and for a driver inside
> a VM that implies VM exits.
> 

Good to hear. Please CCme as well. Many thanks.

> 
> Thanks.
> 
> Alejandro.
> 
>> Patches
>> =======
>>
>> The patches are based on the cxl-type2 support RFCv2 patchset[2]. Will
>> rebase them to V3 once the cxl-type2 support v3 patch review is done.
>>
>> PATCH 1-3: Expose the necessary routines required by vfio-cxl.
>>
>> PATCH 4: Introduce the preludes of vfio-cxl, including CXL device
>> initialization, CXL region creation.
>>
>> PATCH 5: Expose the CXL region to the userspace.
>>
>> PATCH 6-7: Prepare to emulate the HDM decoder registers.
>>
>> PATCH 8: Emulate the HDM decoder registers.
>>
>> PATCH 9: Tweak vfio-cxl to be aware of working on a CXL device.
>>
>> PATCH 10: Tell vfio-pci-core to emulate CXL DVSECs.
>>
>> PATCH 11: Expose the CXL device information that userspace needs.
>>
>> PATCH 12: An example variant driver to demonstrate the usage of
>> vfio-cxl-core from the perspective of the VFIO variant driver.
>>
>> PATCH 13: A workaround needs suggestions.
>>
>> Test
>> ====
>>
>> To test the patches and hack around, a virtual passthrough with nested
>> virtualization approach is used.
>>
>> The host QEMU emulates a CXL type-2 accel device based on Ira's patches
>> with the changes to emulate HDM decoders.
>>
>> While running the vfio-cxl in the L1 guest, an example VFIO variant
>> driver is used to attach with the QEMU CXL access device.
>>
>> The L2 guest can be booted via the QEMU with the vfio-cxl support in the
>> VFIOStub.
>>
>> In the L2 guest, a dummy CXL device driver is provided to attach to the
>> virtual pass-thru device.
>>
>> The dummy CXL type-2 device driver can successfully be loaded with the
>> kernel cxl core type2 support, create CXL region by requesting the CXL
>> core to allocate HPA and DPA and configure the HDM decoders.
>>
>> To make sure everyone can test the patches, the kernel config of L1 and
>> L2 are provided in the repos, the required kernel command params and
>> qemu command line can be found from the demostration video.[5]
>>
>> Repos
>> =====
>>
>> QEMU host: 
>> https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-host
>> L1 Kernel: 
>> https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l1-kernel-rfc
>> L1 QEMU: 
>> https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-l1-rfc
>> L2 Kernel: https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l2
>>
>> [1] https://computeexpresslink.org/cxl-specification/
>> [2] 
>> https://lore.kernel.org/netdev/20240715172835.24757-1-alejandro.lucero-palau@xxxxxxx/T/
>> [3] 
>> https://patchew.org/QEMU/20230517-rfc-type2-dev-v1-0-6eb2e470981b@xxxxxxxxx/
>> [4] https://youtu.be/zlk_ecX9bxs?si=hc8P58AdhGXff3Q7
>>
>> Feedback expected
>> =================
>>
>> - Archtiecture level between vfio-pci-core and vfio-cxl-core.
>> - Variant driver requirements from more hardware vendors.
>> - vfio-cxl-core UABI to QEMU.
>>
>> Moving foward
>> =============
>>
>> - Rebase the patches on top of Alejandro's PATCH v3.
>> - Get Ira's type-2 emulated device patch into upstream as CXL folks 
>> and RH
>>    folks both came to talk and expect this. I had a chat with Ira and he
>>    expected me to take it over. Will start a discussion in the CXL 
>> discord
>>    group for the desgin of V1.
>> - Sparse map in vfio-cxl-core.
>>
>> Known issues
>> ============
>>
>> - Teardown path. Missing teardown paths have been implements in 
>> Alejandor's
>>    PATCH v3. It should be solved after the rebase.
>>
>> - Powerdown L1 guest instead of reboot it. The QEMU reset handler is 
>> missing
>>    in the Ira's patch. When rebooting L1, many CXL registers are not 
>> reset.
>>    This will be addressed in the formal review of emulated CXL type-2 
>> device
>>    support.
>>
>> Zhi Wang (13):
>>    cxl: allow a type-2 device not to have memory device registers
>>    cxl: introduce cxl_get_hdm_info()
>>    cxl: introduce cxl_find_comp_reglock_offset()
>>    vfio: introduce vfio-cxl core preludes
>>    vfio/cxl: expose CXL region to the usersapce via a new VFIO device
>>      region
>>    vfio/pci: expose vfio_pci_rw()
>>    vfio/cxl: introduce vfio_cxl_core_{read, write}()
>>    vfio/cxl: emulate HDM decoder registers
>>    vfio/pci: introduce CXL device awareness
>>    vfio/pci: emulate CXL DVSEC registers in the configuration space
>>    vfio/cxl: introduce VFIO CXL device cap
>>    vfio/cxl: VFIO variant driver for QEMU CXL accel device
>>    vfio/cxl: workaround: don't take resource region when cxl is enabled.
>>
>>   drivers/cxl/core/pci.c              |  28 ++
>>   drivers/cxl/core/regs.c             |  22 +
>>   drivers/cxl/cxl.h                   |   1 +
>>   drivers/cxl/cxlpci.h                |   3 +
>>   drivers/cxl/pci.c                   |  14 +-
>>   drivers/vfio/pci/Kconfig            |   6 +
>>   drivers/vfio/pci/Makefile           |   5 +
>>   drivers/vfio/pci/cxl-accel/Kconfig  |   6 +
>>   drivers/vfio/pci/cxl-accel/Makefile |   3 +
>>   drivers/vfio/pci/cxl-accel/main.c   | 116 +++++
>>   drivers/vfio/pci/vfio_cxl_core.c    | 647 ++++++++++++++++++++++++++++
>>   drivers/vfio/pci/vfio_pci_config.c  |  10 +
>>   drivers/vfio/pci/vfio_pci_core.c    |  79 +++-
>>   drivers/vfio/pci/vfio_pci_rdwr.c    |   8 +-
>>   include/linux/cxl_accel_mem.h       |   3 +
>>   include/linux/cxl_accel_pci.h       |   6 +
>>   include/linux/vfio_pci_core.h       |  53 +++
>>   include/uapi/linux/vfio.h           |  14 +
>>   18 files changed, 992 insertions(+), 32 deletions(-)
>>   create mode 100644 drivers/vfio/pci/cxl-accel/Kconfig
>>   create mode 100644 drivers/vfio/pci/cxl-accel/Makefile
>>   create mode 100644 drivers/vfio/pci/cxl-accel/main.c
>>   create mode 100644 drivers/vfio/pci/vfio_cxl_core.c
>>





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux