Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd

Jiaqi Yan <jiaqiyan@xxxxxxxxxx> · Tue, 21 Jan 2025 13:45:38 -0800

Thanks Jason, your comments are very much appreciated. I replied to
some of them, and need more thoughts for the others.

On Mon, Jan 20, 2025 at 9:26 AM Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
>
> On Sat, Jan 18, 2025 at 11:15:46PM +0000, Jiaqi Yan wrote:
> > In the experimental case, all the setups are identical to the baseline
> > case, however 25% of the guest memory is split from THP to 4K pages due
> > to the memory failure recovery triggered by MADV_HWPOISON. I made some
> > minor changes in the kernel so that the MADV_HWPOISON-ed pages are
> > unpoisoned, and afterwards the in-guest MemCycle is still able to read
> > and write its data. The final aggregate rate is 16,355.11, which is
> > decreased by 5.06% compared to the baseline case. When 5% of the guest
> > memory is split after MADV_HWPOISON, the final aggregate rate is
> > 16,999.14, a drop of 1.20% compared to the baseline case.
>
> I think it was mentioned in one of the calls, but this is good data on
> the CPU side, but for VMs doing IO, the IO performance is impacted
> also. IOTLB miss on (random) IO performance, especially with two
> dimensional IO paging, tends to have a performance curve that drops
> off a cliff once the IOTLB is too small for the workload.
>
> Specifically, systems seem to be designed to require high IOTLB hit
> rate to maintain their target performance and IOTLB miss is much more
> expensive than CPU TLB miss.
>
> So, I would view MemCycle as something of a best case work load that
> is not as sensitive to TLB size. A worst case is a workload that just
> fits inside the TLB and reducing the page sizes pushes it to no longer
> fit.

I think a guest IO benchmarking could be valuable (I wasn't able to
conduct it along with the MemCycler experiment due to resource reason)
so it is added to my TODO list. I think if the number comes out really
bad, it may affect the default behavior of MFR, or at least people's
preference.

Do you know if there is an existing benchmark like memcycler that I
can run in VM, and that exercises the two dimensional IO paging?

>
> > Per-memfd MFR Policy associates the userspace MFR policy with a memfd
> > instance. This approach is promising for the following reasons:
> > 1. Keeping memory with UE mapped to a process has risks if the process
> >    does not do its duty to prevent itself from repeatedly consuming UER.
> >    The MFR policy can be associated with a memfd to limit such risk to a
> >    particular memory space owned by a particular process that opts in
> >    the policy. This is much preferable than the Global MFR Policy
> >    proposed in the initial RFC, which provides no granularity
> >    whatsoever.
>
> Yes, very much agree
>
> > 3. Although MFR policy allows the userspace process to keep memory UE
> >    mapped, eventually these HWPoison-ed folios need to be dealt with by
> >    the kernel (e.g. split into smallest chunk and isolated from
> >    future allocation). For memfd once all references to it are dropped,
> >    it is automatically released from userspace, which is a perfect
> >    timing for the kernel to do its duties to HWPoison-ed folios if any.
> >    This is also a big advantage to the Global MFR Policy, which breaks
> >    kernel’s protection to HWPoison-ed folios.
>
> iommufd will hold the memory pinned for the life of the VM, is that OK
> for this plan?

At appearance, pinned memory (i.e. folio_maybe_dma_pinned=true) by
definition should not be offlined / reclaimed in many places,
including the handling of HWPoison at any stage.

>
> > 4. Given memfd’s anonymous semantic, we don’t need to worry about that
> >    different threads can have different and conflicting MFR policies. It
> >    allows a simpler implementation than the Per-VMA MFR Policy in the
> >    initial RFC [1].
>
> Your policy is per-memfd right?

Yes, basically per-memfd, no VMA involved.

>
> > However, the affected memory will be immediately protected and isolated
> > from future use by both kernel and userspace once the owning memfd is
> > gone or the memory is truncated. By default MFD_MF_KEEP_UE_MAPPED is not
> > set, and kernel hard offlines memory having UEs. Kernel immediately
> > poisons the folios for both cases.
>
> I'm reading this and thinking that today we don't have any callback
> into the iommu to force offline the memory either, so a guest can
> still do DMA to it.
>
> > Part2: When a AS_MF_KEEP_UE_MAPPED memfd is about to be released, or
> > when the userspace process truncates a range of memory pages belonging
> > to a AS_MF_KEEP_UE_MAPPED memfd:
> > * When the in-memory file system is evicting the inode corresponding to
> >   the memfd, it needs to prepare the HWPoison-ed folios that are easily
> >   identifiable with the PG_HWPOISON flag. This operation is implemented
> >   by populate_memfd_hwp_folios and is exported to file systems.
> > * After the file system removes all the folios, there is nothing else
> >   preventing MFR from dealing with HWPoison-ed folios, so the file
> >   system forwards them to MFR. This step is implemented by
> >   offline_memfd_hwp_folios and is exported to file systems.
>
> As above, iommu won't release its refcount after truncate or zap.

Due to the pinning behavior, or something else? Then I think not
offlining the HWPoison folio (i.e. no op) after truncate is expected
behavior, or a return of EBUSY from offline_memfd_hwp_folios.

Taking HugeTLB as an example, I used to think the folio will be
dissovled into 4k pages after truncate. However, it seems today the
huge folio is just isolated from freed and reused[*], like a hugepage
is "leaked" (after truncation, nr_hugepages is not decreased, but
free_hugepages is reduced, as if the exit process or mm still hold
that hugepage).

[*] https://lore.kernel.org/linux-mm/20250119180608.2132296-3-jiaqiyan@xxxxxxxxxx/T/#m54be295de1144eead4ab73c3cc9077b6dd14050f

>
> > * MFR has been holding refcount(s) of each HWPoison-ed folio. After
> >   dropping the refcounts, a HWPoison-ed folio should become free and can
> >   be disposed of.
>
> So you have to deal with "should" being "won't" in cases where VFIO is
> being used...
>
> > In V2 I can probably offline each folio as they get remove, instead of
> > doing this in batch. The advantage is we can get rid of
> > populate_memfd_hwp_folios and the linked list needed to store poisoned
> > folios. One way is to insert filemap_offline_hwpoison_folio into
> > somewhere in folio_batch_release, or into per file system's free_folio
> > handler.
>
> That sounds more workable given the above, though we keep getting into
> cases where people want to hook free_folio..
>
> > 2. In react to later fault to any part of the HWPoison-ed folio, guest
> >    memfd returns KVM_PFN_ERR_HWPOISON, and KVM sends SIGBUS to VMM. This
> >    is good enough for actual hardware corrupted PFN backed GFNs, but not
> >    ideal for the healthy PFNs “offlined” together with the error PFNs.
> >    The userspace MFR policy can be useful if VMM wants KVM to 1. Keep
> >    these GFNs mapped in the stage-2 page table 2. In react to later
> >    access to the actual hardware corrupted part of the HWPoison-ed
> >    folio, there is going to be a (repeated) poison consumption event,
> >    and KVM returns KVM_PFN_ERR_HWPOISON for the actual poisoned PFN.
>
> I feel like the guestmemfd version of this is not about userspace
> mappings but about what is communicated to the secure world.
>
> If normal memfd would leave these pages mapped to the VMA then I'd
> think the guestmemfd version would be to leave the pages mapped to the
> secure world?
>
> Keep in mind that guestmemfd is more complex that kvm today as several
> of the secure world implementations are sharing the stage2/ept
> translation between CPU and IOMMU HW. So you can't just unmap 1G of
> memory without completely breaking the guest.
>
> > This RFC [4] proposes a MFR framework for VFIO device managed userspace
> > memory (i.e. memory regions mapped by remap_pfn_region). The userspace
> > MFR policy can instruct the device driver to keep all PFN mapped in a
> > VMA (i.e. don’t unmap_mapping_range).
>
> Ankit has some patches that cause the MFR framework to send the
> poision events for non-struct page memory to the device driver that
> owns the memory.

But it seems the driver itself has not yet been in charge of unmapping
or not. Instead, MFR framework made the call. I think it is probably
fine the MFR framework can just continue to make the call with a piece
of new info, mapping_mf_keep_ue_mapped/AS_MF_KEEP_UE_MAPPED (in RFC
PATCH 1/3).

>
> > * IOCTL to the VFIO Device File. The device driver usually expose a
> >   file-like uAPI to its managed device memory (e.g. PCI MMIO BAR)
> >   directly with the file to the VFIO device. AS_MF_KEEP_UE_MAPPED can be
> >   placed in the address_space of the file to the VFIO device. Device
> >   driver can implement a specific IOCTL to the VFIO device file for
> >   userspace to set AS_MF_KEEP_UE_MAPPED.
>
> I don't think address spaces are involved in the MFR path after Ankit's
> patch? The dispatch is done entirely on phys_addr_t.

I think strictly speaking Ankit's patch is built around
pfn_address_space[*], and the driver does include address_space when
it registers to core mm. So if the driver wants to be the one make the
call of unmapping or not, it should still be able to access
AS_MF_KEEP_UE_MAPPED.

[*] https://lore.kernel.org/lkml/20231123003513.24292-2-ankita@xxxxxxxxxx/

>
> What happens will be up to the driver that owns the memory.
>
> You could have a VFIO feature that specifies one behavior or the

Do you mean a new one under the VFIO_DEVICE_FEATURE IOCTL?

> other, but perhaps VFIO just always keeps things mapped. I don't know.

I think a proper VFIO guest benchmarking can help tell which behavior is better.

>
> Jason