On 18 Jan 2025, at 18:15, Jiaqi Yan wrote: <snip> > MemCycler Benchmarking > ====================== > > To follow up the question by Dave Hansen, “If one motivation for this is > guest performance, then it would be great to have some data to back that > up, even if it is worst-case data”, we run MemCycler in guest and > compare its performance when there are an extremely large number of > memory errors. > > The MemCycler benchmark cycles through memory with multiple threads. On > each iteration, the thread reads the current value, validates it, and > writes a counter value. The benchmark continuously outputs rates > indicating the speed at which it is reading and writing 64-bit integers, > and aggregates the reads and writes of the multiple threads across > multiple iterations into a single rate (unit: 64-bit per microsecond). > > MemCycler is running inside a VM with 80 vCPUs and 640 GB guest memory. > The hardware platform hosting the VM is using Intel Emerald Rapids CPUs > (in total 120 physical cores) and 1.5 T DDR5 memory. MemCycler allocates > memory with 2M transparent hugepage in the guest. Our in-house VMM backs > the guest memory with 2M transparent hugepage on the host. The final > aggregate rate after 60 runtime is 17,204.69 and referred to as the > baseline case. > > In the experimental case, all the setups are identical to the baseline > case, however 25% of the guest memory is split from THP to 4K pages due > to the memory failure recovery triggered by MADV_HWPOISON. I made some > minor changes in the kernel so that the MADV_HWPOISON-ed pages are > unpoisoned, and afterwards the in-guest MemCycle is still able to read > and write its data. The final aggregate rate is 16,355.11, which is > decreased by 5.06% compared to the baseline case. When 5% of the guest > memory is split after MADV_HWPOISON, the final aggregate rate is > 16,999.14, a drop of 1.20% compared to the baseline case. > <snip> > > Extensibility: THP SHMEM/TMPFS > ============================== > > The current MFR behavior for THP SHMEM/TMPFS is to split the hugepage > into raw page and only offline the raw HWPoison-ed page. In most cases > THP is 2M and raw page size is 4K, so userspace loses the “huge” > property of a 2M huge memory, but the actual data loss is only 4K. I wonder if the buddy allocator like split[1] could help here by splitting the THP to 1MB, 512KB, 256KB, ..., two 4KB, so you still have some mTHPs at the end. [1] https://lore.kernel.org/linux-mm/20250116211042.741543-1-ziy@xxxxxxxxxx/ Best Regards, Yan, Zi