From: William Roche <william.roche@xxxxxxxxxx> Hi David, Here is an updated description of the patch set: --- This set of patches fixes several problems with hardware memory errors impacting hugetlbfs memory backed VMs. When using hugetlbfs large pages, any large page location being impacted by an HW memory error results in poisoning the entire page, suddenly making a large chunk of the VM memory unusable. The main problem that currently exists in Qemu is the lack of backend file repair before resetting the VM memory, resulting in the impacted memory to be silently unusable even after a VM reboot. In order to fix this issue, we track the page size of the impacted memory block with the associated poisoned page location. Using the size information we also call ram_block_discard_range() to regenerate the memory on VM reset when running qemu_ram_remap(). So that a poisoned memory backed by a hugetlbfs file is regenerated with a hole punched in this file. A new page is loaded when the location is first touched. In case of a discard failure we fall back to unmap/remap the memory location and reset the memory settings. We also have to honor the 'prealloc' attribute even after a successful discard, so we reapply the memory settings in this case too. This memory setting is performed by a new remap notification mechanism calling host_memory_backend_ram_remapped() function when a region of a memory block is remapped. Issue also a message providing the impact information of a large page memory loss. Only reported once when the page is poisoned. --- v1 -> v2: . I removed the kernel SIGBUS siginfo provided lsb size information tracking. Only relying on the RAMBlock page_size instead. . I adapted the 3 patches you indicated me to implement the notification mechanism on remap. Thank you for this code! I left them as Authored by you. But I haven't tested if the policy setting works as expected on VM reset, only that the replacement of physical memory works. . I also removed the old memory setting that was kept in qemu_ram_remap() but this small last fix could probably be merged with your last commit. I also got yesterday the recording of the mm-linux session about the kernel modification on largepage poisoning, and discussed this topic with a colleague of mine who attended the meeting. About the use of -mem-path question you asked me, we communicated the information about the deprecated aspect of this option and advise all users to use the following options instead. -object memory-backend-file,id=pc.ram,mem-path=/dev/hugepages,prealloc,size=XXX -machine memory-backend=pc.ram We could now add the request to use a share=on attribute too, to avoid the additional message about dangerous discard situations. This code is scripts/checkpatch.pl clean 'make check' runs fine on both x86 and Arm. David Hildenbrand (3): numa: Introduce and use ram_block_notify_remap() hostmem: Factor out applying settings hostmem: Handle remapping of RAM William Roche (4): accel/kvm: Keep track of the HWPoisonPage page_size system/physmem: poisoned memory discard on reboot accel/kvm: Report the loss of a large memory page system/physmem: Memory settings applied on remap notification accel/kvm/kvm-all.c | 17 +++- backends/hostmem.c | 184 +++++++++++++++++++++++--------------- hw/core/numa.c | 11 +++ include/exec/cpu-common.h | 1 + include/exec/ramlist.h | 3 + include/sysemu/hostmem.h | 1 + include/sysemu/kvm_int.h | 4 +- system/physmem.c | 62 ++++++++----- target/arm/kvm.c | 2 +- target/i386/kvm/kvm.c | 2 +- 10 files changed, 189 insertions(+), 98 deletions(-) -- 2.43.5