Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd

Jason Gunthorpe <jgg@xxxxxxxxxx> · Mon, 20 Jan 2025 13:26:26 -0400

On Sat, Jan 18, 2025 at 11:15:46PM +0000, Jiaqi Yan wrote:
> In the experimental case, all the setups are identical to the baseline
> case, however 25% of the guest memory is split from THP to 4K pages due
> to the memory failure recovery triggered by MADV_HWPOISON. I made some
> minor changes in the kernel so that the MADV_HWPOISON-ed pages are
> unpoisoned, and afterwards the in-guest MemCycle is still able to read
> and write its data. The final aggregate rate is 16,355.11, which is
> decreased by 5.06% compared to the baseline case. When 5% of the guest
> memory is split after MADV_HWPOISON, the final aggregate rate is
> 16,999.14, a drop of 1.20% compared to the baseline case.

I think it was mentioned in one of the calls, but this is good data on
the CPU side, but for VMs doing IO, the IO performance is impacted
also. IOTLB miss on (random) IO performance, especially with two
dimensional IO paging, tends to have a performance curve that drops
off a cliff once the IOTLB is too small for the workload.

Specifically, systems seem to be designed to require high IOTLB hit
rate to maintain their target performance and IOTLB miss is much more
expensive than CPU TLB miss.

So, I would view MemCycle as something of a best case work load that
is not as sensitive to TLB size. A worst case is a workload that just
fits inside the TLB and reducing the page sizes pushes it to no longer
fit.

> Per-memfd MFR Policy associates the userspace MFR policy with a memfd
> instance. This approach is promising for the following reasons:
> 1. Keeping memory with UE mapped to a process has risks if the process
>    does not do its duty to prevent itself from repeatedly consuming UER.
>    The MFR policy can be associated with a memfd to limit such risk to a
>    particular memory space owned by a particular process that opts in
>    the policy. This is much preferable than the Global MFR Policy
>    proposed in the initial RFC, which provides no granularity
>    whatsoever.

Yes, very much agree

> 3. Although MFR policy allows the userspace process to keep memory UE
>    mapped, eventually these HWPoison-ed folios need to be dealt with by
>    the kernel (e.g. split into smallest chunk and isolated from
>    future allocation). For memfd once all references to it are dropped,
>    it is automatically released from userspace, which is a perfect
>    timing for the kernel to do its duties to HWPoison-ed folios if any.
>    This is also a big advantage to the Global MFR Policy, which breaks
>    kernel’s protection to HWPoison-ed folios.

iommufd will hold the memory pinned for the life of the VM, is that OK
for this plan?

> 4. Given memfd’s anonymous semantic, we don’t need to worry about that
>    different threads can have different and conflicting MFR policies. It
>    allows a simpler implementation than the Per-VMA MFR Policy in the
>    initial RFC [1].

Your policy is per-memfd right?

> However, the affected memory will be immediately protected and isolated
> from future use by both kernel and userspace once the owning memfd is
> gone or the memory is truncated. By default MFD_MF_KEEP_UE_MAPPED is not
> set, and kernel hard offlines memory having UEs. Kernel immediately
> poisons the folios for both cases.

I'm reading this and thinking that today we don't have any callback
into the iommu to force offline the memory either, so a guest can
still do DMA to it.

> Part2: When a AS_MF_KEEP_UE_MAPPED memfd is about to be released, or
> when the userspace process truncates a range of memory pages belonging
> to a AS_MF_KEEP_UE_MAPPED memfd:
> * When the in-memory file system is evicting the inode corresponding to
>   the memfd, it needs to prepare the HWPoison-ed folios that are easily
>   identifiable with the PG_HWPOISON flag. This operation is implemented
>   by populate_memfd_hwp_folios and is exported to file systems.
> * After the file system removes all the folios, there is nothing else
>   preventing MFR from dealing with HWPoison-ed folios, so the file
>   system forwards them to MFR. This step is implemented by
>   offline_memfd_hwp_folios and is exported to file systems.

As above, iommu won't release its refcount after truncate or zap.

> * MFR has been holding refcount(s) of each HWPoison-ed folio. After
>   dropping the refcounts, a HWPoison-ed folio should become free and can
>   be disposed of.

So you have to deal with "should" being "won't" in cases where VFIO is
being used...

> In V2 I can probably offline each folio as they get remove, instead of
> doing this in batch. The advantage is we can get rid of
> populate_memfd_hwp_folios and the linked list needed to store poisoned
> folios. One way is to insert filemap_offline_hwpoison_folio into
> somewhere in folio_batch_release, or into per file system's free_folio
> handler.

That sounds more workable given the above, though we keep getting into
cases where people want to hook free_folio..

> 2. In react to later fault to any part of the HWPoison-ed folio, guest
>    memfd returns KVM_PFN_ERR_HWPOISON, and KVM sends SIGBUS to VMM. This
>    is good enough for actual hardware corrupted PFN backed GFNs, but not
>    ideal for the healthy PFNs “offlined” together with the error PFNs.
>    The userspace MFR policy can be useful if VMM wants KVM to 1. Keep
>    these GFNs mapped in the stage-2 page table 2. In react to later
>    access to the actual hardware corrupted part of the HWPoison-ed
>    folio, there is going to be a (repeated) poison consumption event,
>    and KVM returns KVM_PFN_ERR_HWPOISON for the actual poisoned PFN.

I feel like the guestmemfd version of this is not about userspace
mappings but about what is communicated to the secure world.

If normal memfd would leave these pages mapped to the VMA then I'd
think the guestmemfd version would be to leave the pages mapped to the
secure world?

Keep in mind that guestmemfd is more complex that kvm today as several
of the secure world implementations are sharing the stage2/ept
translation between CPU and IOMMU HW. So you can't just unmap 1G of
memory without completely breaking the guest.

> This RFC [4] proposes a MFR framework for VFIO device managed userspace
> memory (i.e. memory regions mapped by remap_pfn_region). The userspace
> MFR policy can instruct the device driver to keep all PFN mapped in a
> VMA (i.e. don’t unmap_mapping_range).

Ankit has some patches that cause the MFR framework to send the
poision events for non-struct page memory to the device driver that
owns the memory.

> * IOCTL to the VFIO Device File. The device driver usually expose a
>   file-like uAPI to its managed device memory (e.g. PCI MMIO BAR)
>   directly with the file to the VFIO device. AS_MF_KEEP_UE_MAPPED can be
>   placed in the address_space of the file to the VFIO device. Device
>   driver can implement a specific IOCTL to the VFIO device file for
>   userspace to set AS_MF_KEEP_UE_MAPPED.

I don't think address spaces are involved in the MFR path after Ankit's
patch? The dispatch is done entirely on phys_addr_t.

What happens will be up to the driver that owns the memory.

You could have a VFIO feature that specifies one behavior or the
other, but perhaps VFIO just always keeps things mapped. I don't know.

Jason