RE: Onlining CXL Type2 device coherent memory

Vikram Sethi <vsethi@xxxxxxxxxx> · Mon, 2 Nov 2020 19:25:23 +0000

> From: Dan Williams <dan.j.williams@xxxxxxxxx>
> On Mon, Nov 2, 2020 at 9:53 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
> >
> > On 02.11.20 17:17, Vikram Sethi wrote:
> > > Hi David,
> > >> From: David Hildenbrand <david@xxxxxxxxxx>
> > >> On 31.10.20 17:51, Dan Williams wrote:
> > >>> On Sat, Oct 31, 2020 at 3:21 AM David Hildenbrand <david@xxxxxxxxxx>
> wrote:
> > >>>>
> > >>>> On 30.10.20 21:37, Dan Williams wrote:
> > >>>>> On Wed, Oct 28, 2020 at 4:06 PM Vikram Sethi <vsethi@xxxxxxxxxx>
> wrote:
> > >>>>>>
> > >>>>>> Hello,
> > >>>>>>
> > >>>>>> I wanted to kick off a discussion on how Linux onlining of CXL [1] type 2
> > >> device
> > >>>>>> Coherent memory aka Host managed device memory (HDM) will work
> for
> > >> type 2 CXL
> > >>>>>> devices which are available/plugged in at boot. A type 2 CXL device can
> be
> > >> simply
> > >>>>>> thought of as an accelerator with coherent device memory, that also has
> a
> > >>>>>> CXL.cache to cache system memory.
> > >>>>>>
> > >>>>>> One could envision that BIOS/UEFI could expose the HDM in EFI memory
> map
> > >>>>>> as conventional memory as well as in ACPI SRAT/SLIT/HMAT. However,
> at
> > >> least
> > >>>>>> on some architectures (arm64) EFI conventional memory available at
> kernel
> > >> boot
> > >>>>>> memory cannot be offlined, so this may not be suitable on all
> architectures.
> > >>>>>
> > >>>>> That seems an odd restriction. Add David, linux-mm, and linux-acpi as
> > >>>>> they might be interested / have comments on this restriction as well.
> > >>>>>
> > >>>>
> > >>>> I am missing some important details.
> > >>>>
> > >>>> a) What happens after offlining? Will the memory be
> remove_memory()'ed?
> > >>>> Will the device get physically unplugged?
> > >>>>
> > > Not always IMO. If the device was getting reset, the HDM memory is going to
> be
> > > unavailable while device is reset. Offlining the memory around the reset would
> >
> > Ouch, that speaks IMHO completely against exposing it as System RAM as
> > default.
> >
I should have clarified memory becomes unavailable on a new "CXL Reset" in CXL 2.0.
FLR does not make device memory unavailable, but there could be devices that
Implement CXL reset but not FLR, as FLR is optional.

> > > be sufficient, but depending if driver had done the add_memory in probe,
> > > it perhaps would be onerous to have to remove_memory as well before reset,
> > > and then add it back after reset. I realize you’re saying such a procedure
> > > would be abusing hotplug framework, and we could perhaps require that
> memory
> > > be removed prior to reset, but not clear to me that it *must* be removed for
> > > correctness.
> > >
> > > Another usecase of offlining without removing HDM could be around
> > > Virtualization/passing entire device with its memory to a VM. If device was
> > > being used in the host kernel, and is then unbound, and bound to vfio-pci
> > > (vfio-cxl?), would we expect vfio-pci to add_memory_driver_managed?
> >
> > At least for passing through memory to VMs (via KVM), you don't actually
> > need struct pages / memory exposed to the buddy via
> > add_memory_driver_managed(). Actually, doing that sounds like the wrong
> > approach.
> >
> > E.g., you would "allocate" the memory via devdax/dax_hmat and directly
> > map the resulting device into guest address space. At least that's what
> > some people are doing with

How does memory_failure forwarding to guest work in that case?
IIUC it doesn't without a struct page in the host. 
For normal memory, when VM consumes poison, host kernel signals
Userspace with SIGBUS and si-code that says Action Required, which 
QEMU injects to the guest.
IBM had done something like you suggest with coherent GPU memory and IIUC
memory_failure forwarding to guest VM does not work there.

kernel https://lkml.org/lkml/2018/12/20/103 
QEMU: https://patchwork.kernel.org/patch/10831455/
I would think we *do want* memory errors to be sent to a VM. 
> 
> ...and Joao is working to see if the host kernel can skip allocating
> 'struct page' or do it on demand if the guest ever requests host
> kernel services on its memory. Typically it does not so host 'struct
> page' space for devdax memory ranges goes wasted.
Is memory_failure forwarded to and handled by guest?