Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project

William Roche <william.roche@xxxxxxxxxx> · Tue, 10 Sep 2024 18:24:11 +0200

On 9/10/24 13:36, David Hildenbrand wrote:

On 10.09.24 12:02, “William Roche wrote:
From: William Roche <william.roche@xxxxxxxxxx>

Hi,

Apologies for the noise; resending as I missed CC'ing the maintainers 
of the
changed files

Hello,

This is a Qemu RFC to introduce the possibility to deal with hardware
memory errors impacting hugetlbfs memory backed VMs. When using
hugetlbfs large pages, any large page location being impacted by an
HW memory error results in poisoning the entire page, suddenly making
a large chunk of the VM memory unusable.

The implemented proposal is simply a memory mapping change when an HW 
error
is reported to Qemu, to transform a hugetlbfs large page into a set of
standard sized pages. The failed large page is unmapped and a set of
standard sized pages are mapped in place.
This mechanism is triggered when a SIGBUS/MCE_MCEERR_Ax signal is 
received
by qemu and the reported location corresponds to a large page.

This gives the possibility to:
- Take advantage of newer hypervisor kernel providing a way to retrieve
still valid data on the impacted hugetlbfs poisoned large page.
If the backend file is MAP_SHARED, we can copy the valid data into the

Thank you David for this first reaction on this proposal.

How are you dealing with other consumers of the shared memory,
such as vhost-user processes,

In the current proposal, I don't deal with this aspect.
In fact, any other process sharing the changed memory will
continue to map the poisoned large page. So any access to
this page will generate a SIGBUS to this other process.

In this situation vhost-user processes should continue to receive
SIGBUS signals (and probably continue to die because of that).

So I do see a real problem if 2 qemu processes are sharing the
same hugetlbfs segment -- in this case, error recovery should not
occur on this piece of the memory. Maybe dealing with this situation
with "ivshmem" options is doable (marking the shared segment
"not eligible" to hugetlbfs recovery, just like not "share=on"
hugetlbfs entries are not eligible)
-- I need to think about this specific case.

Please let me know if there is a better way to deal with this
shared memory aspect and have a better system reaction.

vm migration whereby RAM is migrated using file content,

Migration doesn't currently work with memory poisoning.
You can give a look at the already integrated following commit:

06152b89db64 migration: prevent migration when VM has poisoned memory

This proposal doesn't change anything on this side.

vfio that might have these pages pinned?

AFAIK even pinned memory can be impacted by memory error and poisoned
by the kernel. Now as I said in the cover letter, I'd like to know if
we should take extra care for IO memory, vfio configured memory buffers...

In general, you cannot simply replace pages by private copies
when somebody else might be relying on these pages to go to
actual guest RAM.

This is correct, but the current proposal is dealing with a specific
shared memory type: poisoned large pages. So any other process mapping
this type of page can't access it without generating a SIGBUS.

It sounds very hacky and incomplete at first.

As you can see, RAS features need to be completed.
And if this proposal is incomplete, what other changes should be
done to complete it ?

I do hope we can discuss this RFC to adapt what is incorrect, or
find a better way to address this situation.

Thanks in advance for your feedback,
William.