From: William Roche <william.roche@xxxxxxxxxx> Apologies for the noise; resending as I missed CC'ing the maintainers of the changed files Hello, This is a Qemu RFC to introduce the possibility to deal with hardware memory errors impacting hugetlbfs memory backed VMs. When using hugetlbfs large pages, any large page location being impacted by an HW memory error results in poisoning the entire page, suddenly making a large chunk of the VM memory unusable. The implemented proposal is simply a memory mapping change when an HW error is reported to Qemu, to transform a hugetlbfs large page into a set of standard sized pages. The failed large page is unmapped and a set of standard sized pages are mapped in place. This mechanism is triggered when a SIGBUS/MCE_MCEERR_Ax signal is received by qemu and the reported location corresponds to a large page. This gives the possibility to: - Take advantage of newer hypervisor kernel providing a way to retrieve still valid data on the impacted hugetlbfs poisoned large page. If the backend file is MAP_SHARED, we can copy the valid data into the set of standard sized pages. But if an error is returned when accessing a location we consider it poisoned and mark the corresponding standard sized memory page as poisoned with a MADV_HWPOISON madvise call. Hence, the VM can also continue to use the possible valid pieces of information retrieved. - Adjust the poison address information. When accessing a poison location, an older Kernel version may only provide the address of the beginning of the poisoned large page in the associated SIGBUS siginfo data. Pointing to a more accurate touched poison location allows the VM kernel to trigger the right memory error reaction. A warning is given for hugetlbfs backed memory-regions that are mapped without the 'share=on' option. (This warning is also given when using the deprecated "-mem-path" option) The hugetlbfs memory mapping option should look like that (with XXX replaced with the actual size): -object memory-backend-file,id=pc.ram,mem-path=/dev/hugepages,prealloc=on,share=on,size=XXX -machine memory-backend=pc.ram I'm introducing new system/hugetlbfs_ras.[ch] files to separate the specific code for this feature. It's only compiled on Linux versions. Note that we have to be able to mark as "poison" a replacing valid standard sized page. We currently do that calling madvise(..., MADV_HWPOISON). But this requires qemu process to have CAP_SYS_ADMIN priviledge. Using userfaultfd instead of madvise() to mark the pages as poison could remove this constraint, and complicating the code adding thread(s) dealing with the user page faults service. It's also worth mentioning the IO memory, vfio configured memory buffers case. The Qemu memory remapping (if it succeeds) will not reconfigure any device IO buffers locations (no dma unmap/remap is performed) and if an hardware IO is supposed to access (read or write) a poisoned hugetlbfs page, I would expect it to fail the same way as before (as its location hasn't been updated to take into account the new mapping). But can someone confirm this possible behavior ? Or indicate me what should be done to deal with this type of memory buffers ? Details: -------- The following problems had to be considered: . kvm dealing with memory faults: - Address space mapping changes can't be handled in a signal handler (mmap is not async signal safe for example) We have a separate listener thread (only created when we use hugetlbfs) to deal with the mapping changes. - If a memory is not mapped when accessed, kvm fails with (exit_reason: KVM_EXIT_UNKNOWN) To avoid that, I needed to prevent the access to a changing memory region: pausing the VM is used to do so. - A fault on a poisoned hugetlbfs large page will report a hardcoded page size of 4k (See kernel kvm_send_hwpoison_signal() function) When a SIGBUS is received with a page size indication of 4k we have to verify if the impacted page is not a hugetlbfs page. - Asynchronous SIGBUS/BUS_MCEERR_AO signals provide the right page size, but the current Qemu version needs to take the information into account. . system/physmem needed fixes: - When recreating the memory mapping on VM reset, we have to consider the memory size impacted. - In the case of a mapped file, punching a hole is necessary to clean the poison. . Implementation details: - SIGBUS signal received for a large page will trigger the page modification, but in order to pause the VM, the signal handers have to terminate. So we return from the SIGBUS signal handler(s) when a VM has to be stopped. A memory access that generated a SIGBUS/BUS_MCEERR_AR signals before the VM pause, will be repeated when the VM resumes. If the memory is still not accessible (poisoned) the signal will be generated again by the hypervisor kernel. In the case of an asyncrounous SIGBUS/BUS_MCEERR_AO signal, the signal is not repeated by the kernel and will be recorded by qemu in order to be replayed when the VM resumes. - Poisoning a memory page with MADV_HWPOISON can generate a SIGBUS when called. The listener thread taking care of the memory modification needs to deal with this case. To do so, it sets a thread specific variable that is recognized by the sigbus handler. Some questions: --------------- . Should we take extra care for IO memory, vfio configured memory buffers ? . My feature code is enclosed within "ifdef CONFIG_HUGETLBFS_RAS" and is only compiled on linux versions Should we have a configure option to prevent the introduction of this feature in the code (turning off CONFIG_HUGETLBFS_RAS) ? . Should I include the content of my system/hugetlbfs_ras.[ch] files into another existing file ? . Should we force 'sharing' when using "-mem-path" option, instead of the -object memory-backend-file,share=on,... ? This prototype is scripts/checkpatch.pl clean (except for the MAINTAINERS update for the 2 added files). 'make check' runs fine on both x86 and ARM Units tests have been done on Intel, AMD and ARM platforms. William Roche (6): accel/kvm: SIGBUS handler should also deal with si_addr_lsb accel/kvm: Keep track of the HWPoisonPage sizes system/physmem: Remap memory pages on reset based on the page size system: Introducing hugetlbfs largepage RAS feature system/hugetlb_ras: Handle madvise SIGBUS signal on listener system/hugetlb_ras: Replay lost BUS_MCEERR_AO signals on VM resume accel/kvm/kvm-all.c | 24 +- accel/stubs/kvm-stub.c | 4 +- include/qemu/osdep.h | 5 +- include/sysemu/kvm.h | 7 +- include/sysemu/kvm_int.h | 3 +- meson.build | 2 + system/cpus.c | 15 +- system/hugetlbfs_ras.c | 645 +++++++++++++++++++++++++++++++++++++++ system/hugetlbfs_ras.h | 4 + system/meson.build | 1 + system/physmem.c | 30 ++ target/arm/kvm.c | 15 +- target/i386/kvm/kvm.c | 15 +- util/oslib-posix.c | 3 + 14 files changed, 753 insertions(+), 20 deletions(-) create mode 100644 system/hugetlbfs_ras.c create mode 100644 system/hugetlbfs_ras.h -- 2.43.5