Re: [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough

Zhi Wang <zhiw@xxxxxxxxxx> · Tue, 24 Sep 2024 08:30:17 +0000

On 23/09/2024 11.00, Tian, Kevin wrote:
> External email: Use caution opening links or attachments
> 
> 
>> From: Zhi Wang <zhiw@xxxxxxxxxx>
>> Sent: Saturday, September 21, 2024 6:35 AM
>>
> [...]
>> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
>> (Device PA) needs to be created to access the device memory directly. HDM
>> decoders in the CXL topology need to be configured level by level to
>> manage the mapping. After the region is created, it needs to be mapped to
>> GPA in the virtual HDM decoders configured by the VM.
> 
> Any time when a new address space is introduced it's worthy of more
> context to help people who have no CXL background better understand
> the mechanism and think any potential hole.
> 
> At a glance looks we are talking about a mapping tier:
> 
>    GPA->HPA->DPA
> 
> The location/size of HPA/DPA for a cxl region are decided and mapped
> at @open_device and the HPA range is mapped to GPA at @mmap.
> 
> In addition the guest also manages a virtual HDM decoder:
> 
>    GPA->vDPA
> 
> Ideally the vDPA range selected by guest is a subset of the physical
> cxl region so based on offset and vHDM the VMM may figure out
> which offset in the cxl region to be mmaped for the corresponding
> GPA (which in the end maps to the desired DPA).
> 
> Is this understanding correct?
> 

Yes. Many thanks to summarize this. It is a design decision from a 
discussion in the CXL discord channel.

> btw is one cxl device only allowed to create one region? If multiple
> regions are possible how will they be exposed to the guest?
>

It is not an (shouldn't be) enforced requirement from the VFIO cxl core. 
It is really requirement-driven. I am expecting what kind of use cases 
in reality that needs multiple CXL regions in the host and then passing 
multiple regions to the guest.

Presumably, the host creates one large CXL region that covers the entire 
DPA, while QEMU can virtually partition it into different regions and 
map them to different virtual CXL region if QEMU presents multiple HDM 
decoders to the guest.

>>
>> - CXL reset. The CXL device reset is different from the PCI device reset.
>> A CXL reset sequence is introduced by the CXL spec.
>>
>> - Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the
>> configuration for device enumeration and device control. (E.g. if a device
>> is capable of CXL.mem CXL.cache, enable/disable capability) They are owned
>> by the kernel CXL core, and the VM can not modify them.
> 
> any side effect from emulating it purely in software (patch10), e.g. when
> the guest desired configuration is different from the physical one?
> 

This should be with a summary and later be decided if mediate pass 
through is needed. In this RFC, its goal is just to prevent the guest to 
modify pRegs.

>>
>> - Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO registers
>> that can sit in a PCI BAR. The location of register groups sit in the PCI
>> BAR is indicated by the register locator in the CXL DVSECs. They are also
>> owned by the kernel CXL core. Some of them need to be emulated.
> 
> ditto
> 
>>
>> In the L2 guest, a dummy CXL device driver is provided to attach to the
>> virtual pass-thru device.
>>
>> The dummy CXL type-2 device driver can successfully be loaded with the
>> kernel cxl core type2 support, create CXL region by requesting the CXL
>> core to allocate HPA and DPA and configure the HDM decoders.
> 
> It'd be good to see a real cxl device working to add confidence on
> the core design.

To leverage the opportunity of F2F discussion in LPC, I proposed this 
patchset to start the discussion and meanwhile offered an environment 
for people to try and hack around. Also patches is good base for 
discussion. We see what we will get. :)

There are devices already there and on-going. AMD's SFC (patches are 
under review) and I think they are going to be the first variant driver 
that use the core. NVIDIA's device is also coming and NVIDIA's variant 
driver is going upstream for sure. Plus this emulated device, I assume 
we will have three in-tree variant drivers talks to the CXL core.

Thanks,
Zhi.