> From: Dan Williams <dan.j.williams@xxxxxxxxx> > On Mon, Nov 2, 2020 at 9:53 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > > > On 02.11.20 17:17, Vikram Sethi wrote: > > > Hi David, > > >> From: David Hildenbrand <david@xxxxxxxxxx> > > >> On 31.10.20 17:51, Dan Williams wrote: > > >>> On Sat, Oct 31, 2020 at 3:21 AM David Hildenbrand <david@xxxxxxxxxx> > wrote: > > >>>> > > >>>> On 30.10.20 21:37, Dan Williams wrote: > > >>>>> On Wed, Oct 28, 2020 at 4:06 PM Vikram Sethi <vsethi@xxxxxxxxxx> > wrote: > > >>>>>> > > >>>>>> Hello, > > >>>>>> > > >>>>>> I wanted to kick off a discussion on how Linux onlining of CXL [1] type 2 > > >> device > > >>>>>> Coherent memory aka Host managed device memory (HDM) will work > for > > >> type 2 CXL > > >>>>>> devices which are available/plugged in at boot. A type 2 CXL device can > be > > >> simply > > >>>>>> thought of as an accelerator with coherent device memory, that also has > a > > >>>>>> CXL.cache to cache system memory. > > >>>>>> > > >>>>>> One could envision that BIOS/UEFI could expose the HDM in EFI memory > map > > >>>>>> as conventional memory as well as in ACPI SRAT/SLIT/HMAT. However, > at > > >> least > > >>>>>> on some architectures (arm64) EFI conventional memory available at > kernel > > >> boot > > >>>>>> memory cannot be offlined, so this may not be suitable on all > architectures. > > >>>>> > > >>>>> That seems an odd restriction. Add David, linux-mm, and linux-acpi as > > >>>>> they might be interested / have comments on this restriction as well. > > >>>>> > > >>>> > > >>>> I am missing some important details. > > >>>> > > >>>> a) What happens after offlining? Will the memory be > remove_memory()'ed? > > >>>> Will the device get physically unplugged? > > >>>> > > > Not always IMO. If the device was getting reset, the HDM memory is going to > be > > > unavailable while device is reset. Offlining the memory around the reset would > > > > Ouch, that speaks IMHO completely against exposing it as System RAM as > > default. > > I should have clarified memory becomes unavailable on a new "CXL Reset" in CXL 2.0. FLR does not make device memory unavailable, but there could be devices that Implement CXL reset but not FLR, as FLR is optional. > > > be sufficient, but depending if driver had done the add_memory in probe, > > > it perhaps would be onerous to have to remove_memory as well before reset, > > > and then add it back after reset. I realize you’re saying such a procedure > > > would be abusing hotplug framework, and we could perhaps require that > memory > > > be removed prior to reset, but not clear to me that it *must* be removed for > > > correctness. > > > > > > Another usecase of offlining without removing HDM could be around > > > Virtualization/passing entire device with its memory to a VM. If device was > > > being used in the host kernel, and is then unbound, and bound to vfio-pci > > > (vfio-cxl?), would we expect vfio-pci to add_memory_driver_managed? > > > > At least for passing through memory to VMs (via KVM), you don't actually > > need struct pages / memory exposed to the buddy via > > add_memory_driver_managed(). Actually, doing that sounds like the wrong > > approach. > > > > E.g., you would "allocate" the memory via devdax/dax_hmat and directly > > map the resulting device into guest address space. At least that's what > > some people are doing with How does memory_failure forwarding to guest work in that case? IIUC it doesn't without a struct page in the host. For normal memory, when VM consumes poison, host kernel signals Userspace with SIGBUS and si-code that says Action Required, which QEMU injects to the guest. IBM had done something like you suggest with coherent GPU memory and IIUC memory_failure forwarding to guest VM does not work there. kernel https://lkml.org/lkml/2018/12/20/103 QEMU: https://patchwork.kernel.org/patch/10831455/ I would think we *do want* memory errors to be sent to a VM. > > ...and Joao is working to see if the host kernel can skip allocating > 'struct page' or do it on demand if the guest ever requests host > kernel services on its memory. Typically it does not so host 'struct > page' space for devdax memory ranges goes wasted. Is memory_failure forwarded to and handled by guest?