RE: [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough

"Tian, Kevin" <kevin.tian@xxxxxxxxx> · Thu, 26 Sep 2024 06:55:06 +0000

> From: Zhi Wang <zhiw@xxxxxxxxxx>
> Sent: Tuesday, September 24, 2024 4:30 PM
> 
> On 23/09/2024 11.00, Tian, Kevin wrote:
> > External email: Use caution opening links or attachments
> >
> >
> >> From: Zhi Wang <zhiw@xxxxxxxxxx>
> >> Sent: Saturday, September 21, 2024 6:35 AM
> >>
> > [...]
> >> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
> >> (Device PA) needs to be created to access the device memory directly. HDM
> >> decoders in the CXL topology need to be configured level by level to
> >> manage the mapping. After the region is created, it needs to be mapped to
> >> GPA in the virtual HDM decoders configured by the VM.
> >
> > Any time when a new address space is introduced it's worthy of more
> > context to help people who have no CXL background better understand
> > the mechanism and think any potential hole.
> >
> > At a glance looks we are talking about a mapping tier:
> >
> >    GPA->HPA->DPA
> >
> > The location/size of HPA/DPA for a cxl region are decided and mapped
> > at @open_device and the HPA range is mapped to GPA at @mmap.
> >
> > In addition the guest also manages a virtual HDM decoder:
> >
> >    GPA->vDPA
> >
> > Ideally the vDPA range selected by guest is a subset of the physical
> > cxl region so based on offset and vHDM the VMM may figure out
> > which offset in the cxl region to be mmaped for the corresponding
> > GPA (which in the end maps to the desired DPA).
> >
> > Is this understanding correct?
> >
> 
> Yes. Many thanks to summarize this. It is a design decision from a
> discussion in the CXL discord channel.
> 
> > btw is one cxl device only allowed to create one region? If multiple
> > regions are possible how will they be exposed to the guest?
> >
> 
> It is not an (shouldn't be) enforced requirement from the VFIO cxl core.
> It is really requirement-driven. I am expecting what kind of use cases
> in reality that needs multiple CXL regions in the host and then passing
> multiple regions to the guest.
> 
> Presumably, the host creates one large CXL region that covers the entire
> DPA, while QEMU can virtually partition it into different regions and
> map them to different virtual CXL region if QEMU presents multiple HDM
> decoders to the guest.

non-cxl guys have no idea about what a region is and how it is associated
to the backing hardware resource, e.g. it's created by software then
when the virtual CXL device is composed how is that software-decided
region translated back to a set of virtual CXL hw resource enumerable
to the guest, etc.

In your description, QEMU, as the virtual platform, map the VFIO CXL
region into different virtual CXL regions. This kind of suggests regions
are created by hw, conflicting with the point having sw create it.

We need a fully picture to connect relevant knowledge points in CXL
so the proposal can be better reviewed in the VFIO side. 😊

> 
> >>
> >> - CXL reset. The CXL device reset is different from the PCI device reset.
> >> A CXL reset sequence is introduced by the CXL spec.
> >>
> >> - Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the
> >> configuration for device enumeration and device control. (E.g. if a device
> >> is capable of CXL.mem CXL.cache, enable/disable capability) They are owned
> >> by the kernel CXL core, and the VM can not modify them.
> >
> > any side effect from emulating it purely in software (patch10), e.g. when
> > the guest desired configuration is different from the physical one?
> >
> 
> This should be with a summary and later be decided if mediate pass
> through is needed. In this RFC, its goal is just to prevent the guest to
> modify pRegs.

Look forward to that information in future posting.

> 
> >>
> >> - Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO registers
> >> that can sit in a PCI BAR. The location of register groups sit in the PCI
> >> BAR is indicated by the register locator in the CXL DVSECs. They are also
> >> owned by the kernel CXL core. Some of them need to be emulated.
> >
> > ditto
> >
> >>
> >> In the L2 guest, a dummy CXL device driver is provided to attach to the
> >> virtual pass-thru device.
> >>
> >> The dummy CXL type-2 device driver can successfully be loaded with the
> >> kernel cxl core type2 support, create CXL region by requesting the CXL
> >> core to allocate HPA and DPA and configure the HDM decoders.
> >
> > It'd be good to see a real cxl device working to add confidence on
> > the core design.
> 
> To leverage the opportunity of F2F discussion in LPC, I proposed this
> patchset to start the discussion and meanwhile offered an environment
> for people to try and hack around. Also patches is good base for
> discussion. We see what we will get. :)
> 
> There are devices already there and on-going. AMD's SFC (patches are
> under review) and I think they are going to be the first variant driver
> that use the core. NVIDIA's device is also coming and NVIDIA's variant
> driver is going upstream for sure. Plus this emulated device, I assume
> we will have three in-tree variant drivers talks to the CXL core.
> 

Yeah, this sounds a great first step!