Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10.09.24 12:02, “William Roche wrote:
From: William Roche <william.roche@xxxxxxxxxx>


Hi,


Apologies for the noise; resending as I missed CC'ing the maintainers of the
changed files


Hello,

This is a Qemu RFC to introduce the possibility to deal with hardware
memory errors impacting hugetlbfs memory backed VMs. When using
hugetlbfs large pages, any large page location being impacted by an
HW memory error results in poisoning the entire page, suddenly making
a large chunk of the VM memory unusable.

The implemented proposal is simply a memory mapping change when an HW error
is reported to Qemu, to transform a hugetlbfs large page into a set of
standard sized pages. The failed large page is unmapped and a set of
standard sized pages are mapped in place.
This mechanism is triggered when a SIGBUS/MCE_MCEERR_Ax signal is received
by qemu and the reported location corresponds to a large page.

This gives the possibility to:
- Take advantage of newer hypervisor kernel providing a way to retrieve
still valid data on the impacted hugetlbfs poisoned large page.
If the backend file is MAP_SHARED, we can copy the valid data into the

How are you dealing with other consumers of the shared memory, such as vhost-user processes, vm migration whereby RAM is migrated using file content, vfio that might have these pages pinned?

In general, you cannot simply replace pages by private copies when somebody else might be relying on these pages to go to actual guest RAM.

It sounds very hacky and incomplete at first.


--
Cheers,

David / dhildenb





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux