Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd

Zi Yan <ziy@xxxxxxxxxx> · Wed, 22 Jan 2025 11:41:24 -0500

On 18 Jan 2025, at 18:15, Jiaqi Yan wrote:

<snip>

> MemCycler Benchmarking
> ======================
>
> To follow up the question by Dave Hansen, “If one motivation for this is
> guest performance, then it would be great to have some data to back that
> up, even if it is worst-case data”, we run MemCycler in guest and
> compare its performance when there are an extremely large number of
> memory errors.
>
> The MemCycler benchmark cycles through memory with multiple threads. On
> each iteration, the thread reads the current value, validates it, and
> writes a counter value. The benchmark continuously outputs rates
> indicating the speed at which it is reading and writing 64-bit integers,
> and aggregates the reads and writes of the multiple threads across
> multiple iterations into a single rate (unit: 64-bit per microsecond).
>
> MemCycler is running inside a VM with 80 vCPUs and 640 GB guest memory.
> The hardware platform hosting the VM is using Intel Emerald Rapids CPUs
> (in total 120 physical cores) and 1.5 T DDR5 memory. MemCycler allocates
> memory with 2M transparent hugepage in the guest. Our in-house VMM backs
> the guest memory with 2M transparent hugepage on the host. The final
> aggregate rate after 60 runtime is 17,204.69 and referred to as the
> baseline case.
>
> In the experimental case, all the setups are identical to the baseline
> case, however 25% of the guest memory is split from THP to 4K pages due
> to the memory failure recovery triggered by MADV_HWPOISON. I made some
> minor changes in the kernel so that the MADV_HWPOISON-ed pages are
> unpoisoned, and afterwards the in-guest MemCycle is still able to read
> and write its data. The final aggregate rate is 16,355.11, which is
> decreased by 5.06% compared to the baseline case. When 5% of the guest
> memory is split after MADV_HWPOISON, the final aggregate rate is
> 16,999.14, a drop of 1.20% compared to the baseline case.
>
<snip>
>
> Extensibility: THP SHMEM/TMPFS
> ==============================
>
> The current MFR behavior for THP SHMEM/TMPFS is to split the hugepage
> into raw page and only offline the raw HWPoison-ed page. In most cases
> THP is 2M and raw page size is 4K, so userspace loses the “huge”
> property of a 2M huge memory, but the actual data loss is only 4K.

I wonder if the buddy allocator like split[1] could help here by splitting
the THP to 1MB, 512KB, 256KB, ..., two 4KB, so you still have some mTHPs
at the end.

[1] https://lore.kernel.org/linux-mm/20250116211042.741543-1-ziy@xxxxxxxxxx/

Best Regards,
Yan, Zi