On 25/09/2024 12.11, Alejandro Lucero Palau wrote: > External email: Use caution opening links or attachments > > > On 9/20/24 23:34, Zhi Wang wrote: >> Hi folks: >> >> As promised in the LPC, here are all you need (patches, repos, guiding >> video, kernel config) to build a environment to test the vfio-cxl-core. >> >> Thanks so much for the discussions! Enjoy and see you in the next one. >> >> Background >> ========== >> >> Compute Express Link (CXL) is an open standard interconnect built upon >> industrial PCI layers to enhance the performance and efficiency of data >> centers by enabling high-speed, low-latency communication between CPUs >> and various types of devices such as accelerators, memory. >> >> It supports three key protocols: CXL.io as the control protocol, >> CXL.cache >> as the cache-coherent host-device data transfer protocol, and CXL.mem as >> memory expansion protocol. CXL Type 2 devices leverage the three >> protocols >> to seamlessly integrate with host CPUs, providing a unified and efficient >> interface for high-speed data transfer and memory sharing. This >> integration >> is crucial for heterogeneous computing environments where accelerators, >> such as GPUs, and other specialized processors, are used to handle >> intensive workloads. >> >> Goal >> ==== >> >> Although CXL is built upon the PCI layers, passing a CXL type-2 device >> can >> be different than PCI devices according to CXL specification[1]: >> >> - CXL type-2 device initialization. CXL type-2 device requires an >> additional initialization sequence besides the PCI device initialization. >> CXL type-2 device initialization can be pretty complicated due to its >> hierarchy of register interfaces. Thus, a standard CXL type-2 driver >> initialization sequence provided by the kernel CXL core is used. >> >> - Create a CXL region and map it to the VM. A mapping between HPA and DPA >> (Device PA) needs to be created to access the device memory directly. HDM >> decoders in the CXL topology need to be configured level by level to >> manage the mapping. After the region is created, it needs to be mapped to >> GPA in the virtual HDM decoders configured by the VM. >> >> - CXL reset. The CXL device reset is different from the PCI device reset. >> A CXL reset sequence is introduced by the CXL spec. >> >> - Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the >> configuration for device enumeration and device control. (E.g. if a >> device >> is capable of CXL.mem CXL.cache, enable/disable capability) They are >> owned >> by the kernel CXL core, and the VM can not modify them. >> >> - Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO >> registers >> that can sit in a PCI BAR. The location of register groups sit in the PCI >> BAR is indicated by the register locator in the CXL DVSECs. They are also >> owned by the kernel CXL core. Some of them need to be emulated. >> >> Design >> ====== >> >> To achieve the purpose above, the vfio-cxl-core is introduced to host the >> common routines that variant driver requires for device passthrough. >> Similar with the vfio-pci-core, the vfio-cxl-core provides common >> routines of vfio_device_ops for the variant driver to hook and perform >> the >> CXL routines behind it. >> >> Besides, several extra APIs are introduced for the variant driver to >> provide the necessary information the kernel CXL core to initialize >> the CXL device. E.g., Device DPA. >> >> CXL is built upon the PCI layers but with differences. Thus, the >> vfio-pci-core is aimed to be re-used as much as possible with the >> awareness of operating on a CXL device. >> >> A new VFIO device region is introduced to expose the CXL region to the >> userspace. A new CXL VFIO device cap has also been introduced to convey >> the necessary CXL device information to the userspace. > > > > Hi Zhi, > > > As you know, I was confused with this work but after looking at the > patchset and thinking about all this, it makes sense now. FWIW, the most > confusing point was to use the CXL core inside the VM for creating the > region what implies commits to the CXL root switch/complex and any other > switch in the path. I realize now it will happen but on emulated > hardware with no implication to the real one, which was updated with any > necessary change like those commits by the vfio cxl code in the host (L1 > VM in your tests). > > > The only problem I can see with this approach is the CXL initialization > is left unconditionally to the hypervisor. I guess most of the time will > be fine, but the driver could not be mapping/using the whole CXL mem > always. I know this could be awkward, but possible depending on the > device state unrelated to CXL itself. Will this device states be one-time on/off state or a runtime configuration state that a guest need to poke all the time? There can be two paths for handling these states in a vendor-specific variant driver: 1) vfio_device->fops->open() path, it suits for one-time on/off state 2) vfio_device->fops->{read, write}(), the VM exit->QEMU->variant driver path. The vendor-specific driver can configure the HW based on the register access from the guest. It would be nice to know more about this, like how many registers the vendor-specific driver would like to handle. Thus, the VFIO CXL core can provide common helpers. In other words, this approach > assumes beforehand something which could not be true. What I had in mind > was to have VM exits for any action on CXL configuration on behalf of > that device/driver inside the device. > Initially, this was a idea from Dan. I think this would be a good topic for the next CXL open-source collaboration meeting. Kevn also commented for this. > > This is all more problematic with CXL.cache, and I think the same > approach can not be followed. I'm writing a document trying to share all > my concerns about CXL.cache and DMA/IOMMU mappings, and I will cover > this for sure. As a quick note, while DMA/IOMMU has no limitations > regarding the amount of memory to use, it is unlikely the same can be > done due to scarce host snoop cache resources, therefore the CXL.cache > mappings will likely need to be explicitly done by the driver and > approved by the CXL core (along with DMA/IOMMU), and for a driver inside > a VM that implies VM exits. > Good to hear. Please CCme as well. Many thanks. > > Thanks. > > Alejandro. > >> Patches >> ======= >> >> The patches are based on the cxl-type2 support RFCv2 patchset[2]. Will >> rebase them to V3 once the cxl-type2 support v3 patch review is done. >> >> PATCH 1-3: Expose the necessary routines required by vfio-cxl. >> >> PATCH 4: Introduce the preludes of vfio-cxl, including CXL device >> initialization, CXL region creation. >> >> PATCH 5: Expose the CXL region to the userspace. >> >> PATCH 6-7: Prepare to emulate the HDM decoder registers. >> >> PATCH 8: Emulate the HDM decoder registers. >> >> PATCH 9: Tweak vfio-cxl to be aware of working on a CXL device. >> >> PATCH 10: Tell vfio-pci-core to emulate CXL DVSECs. >> >> PATCH 11: Expose the CXL device information that userspace needs. >> >> PATCH 12: An example variant driver to demonstrate the usage of >> vfio-cxl-core from the perspective of the VFIO variant driver. >> >> PATCH 13: A workaround needs suggestions. >> >> Test >> ==== >> >> To test the patches and hack around, a virtual passthrough with nested >> virtualization approach is used. >> >> The host QEMU emulates a CXL type-2 accel device based on Ira's patches >> with the changes to emulate HDM decoders. >> >> While running the vfio-cxl in the L1 guest, an example VFIO variant >> driver is used to attach with the QEMU CXL access device. >> >> The L2 guest can be booted via the QEMU with the vfio-cxl support in the >> VFIOStub. >> >> In the L2 guest, a dummy CXL device driver is provided to attach to the >> virtual pass-thru device. >> >> The dummy CXL type-2 device driver can successfully be loaded with the >> kernel cxl core type2 support, create CXL region by requesting the CXL >> core to allocate HPA and DPA and configure the HDM decoders. >> >> To make sure everyone can test the patches, the kernel config of L1 and >> L2 are provided in the repos, the required kernel command params and >> qemu command line can be found from the demostration video.[5] >> >> Repos >> ===== >> >> QEMU host: >> https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-host >> L1 Kernel: >> https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l1-kernel-rfc >> L1 QEMU: >> https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-l1-rfc >> L2 Kernel: https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l2 >> >> [1] https://computeexpresslink.org/cxl-specification/ >> [2] >> https://lore.kernel.org/netdev/20240715172835.24757-1-alejandro.lucero-palau@xxxxxxx/T/ >> [3] >> https://patchew.org/QEMU/20230517-rfc-type2-dev-v1-0-6eb2e470981b@xxxxxxxxx/ >> [4] https://youtu.be/zlk_ecX9bxs?si=hc8P58AdhGXff3Q7 >> >> Feedback expected >> ================= >> >> - Archtiecture level between vfio-pci-core and vfio-cxl-core. >> - Variant driver requirements from more hardware vendors. >> - vfio-cxl-core UABI to QEMU. >> >> Moving foward >> ============= >> >> - Rebase the patches on top of Alejandro's PATCH v3. >> - Get Ira's type-2 emulated device patch into upstream as CXL folks >> and RH >> folks both came to talk and expect this. I had a chat with Ira and he >> expected me to take it over. Will start a discussion in the CXL >> discord >> group for the desgin of V1. >> - Sparse map in vfio-cxl-core. >> >> Known issues >> ============ >> >> - Teardown path. Missing teardown paths have been implements in >> Alejandor's >> PATCH v3. It should be solved after the rebase. >> >> - Powerdown L1 guest instead of reboot it. The QEMU reset handler is >> missing >> in the Ira's patch. When rebooting L1, many CXL registers are not >> reset. >> This will be addressed in the formal review of emulated CXL type-2 >> device >> support. >> >> Zhi Wang (13): >> cxl: allow a type-2 device not to have memory device registers >> cxl: introduce cxl_get_hdm_info() >> cxl: introduce cxl_find_comp_reglock_offset() >> vfio: introduce vfio-cxl core preludes >> vfio/cxl: expose CXL region to the usersapce via a new VFIO device >> region >> vfio/pci: expose vfio_pci_rw() >> vfio/cxl: introduce vfio_cxl_core_{read, write}() >> vfio/cxl: emulate HDM decoder registers >> vfio/pci: introduce CXL device awareness >> vfio/pci: emulate CXL DVSEC registers in the configuration space >> vfio/cxl: introduce VFIO CXL device cap >> vfio/cxl: VFIO variant driver for QEMU CXL accel device >> vfio/cxl: workaround: don't take resource region when cxl is enabled. >> >> drivers/cxl/core/pci.c | 28 ++ >> drivers/cxl/core/regs.c | 22 + >> drivers/cxl/cxl.h | 1 + >> drivers/cxl/cxlpci.h | 3 + >> drivers/cxl/pci.c | 14 +- >> drivers/vfio/pci/Kconfig | 6 + >> drivers/vfio/pci/Makefile | 5 + >> drivers/vfio/pci/cxl-accel/Kconfig | 6 + >> drivers/vfio/pci/cxl-accel/Makefile | 3 + >> drivers/vfio/pci/cxl-accel/main.c | 116 +++++ >> drivers/vfio/pci/vfio_cxl_core.c | 647 ++++++++++++++++++++++++++++ >> drivers/vfio/pci/vfio_pci_config.c | 10 + >> drivers/vfio/pci/vfio_pci_core.c | 79 +++- >> drivers/vfio/pci/vfio_pci_rdwr.c | 8 +- >> include/linux/cxl_accel_mem.h | 3 + >> include/linux/cxl_accel_pci.h | 6 + >> include/linux/vfio_pci_core.h | 53 +++ >> include/uapi/linux/vfio.h | 14 + >> 18 files changed, 992 insertions(+), 32 deletions(-) >> create mode 100644 drivers/vfio/pci/cxl-accel/Kconfig >> create mode 100644 drivers/vfio/pci/cxl-accel/Makefile >> create mode 100644 drivers/vfio/pci/cxl-accel/main.c >> create mode 100644 drivers/vfio/pci/vfio_cxl_core.c >>