Re: [RFC RESEND 0/6] hugetlbfs largepage RAS project

William Roche <william.roche@xxxxxxxxxx> · Thu, 12 Sep 2024 19:07:08 +0200

On 9/12/24 00:07, David Hildenbrand wrote:

Hi again,

This is a Qemu RFC to introduce the possibility to deal with hardware
memory errors impacting hugetlbfs memory backed VMs. When using
hugetlbfs large pages, any large page location being impacted by an
HW memory error results in poisoning the entire page, suddenly making
a large chunk of the VM memory unusable.

The implemented proposal is simply a memory mapping change when an HW
error
is reported to Qemu, to transform a hugetlbfs large page into a set of
standard sized pages. The failed large page is unmapped and a set of
standard sized pages are mapped in place.
This mechanism is triggered when a SIGBUS/MCE_MCEERR_Ax signal is
received
by qemu and the reported location corresponds to a large page.

One clarifying question: you simply replace the hugetlb page by 
multiple small pages using mmap(MAP_FIXED).

That's right.

So you

(a) are not able to recover any memory of the original page (as of now)
Once poisoned by the kernel, the original large page is entirely not 
accessible
anymore, but the Kernel can provide what remains from the poisoned hugetlbfs
page through the backend file.  (When this file was mapped MAP_SHARED)

(b) no longer have a hugetlb page and, therefore, possibly a performance
    degradation, relevant in low-latency applications that really care
    about the usage of hugetlb pages.
This is correct.

(c) run into the described inconsistency issues
The inconsistency I agreed upon is the case of 2 qemu processes sharing 
a piece of
the memory (through the ivshmem mechanism) which can be fixed by disabling
recovery for ivshmem associated hugetlbfs segment.

Why is what you propose beneficial over just fallocate(PUNCH_HOLE) the 
full page and get a fresh, non-poisoned page instead?

Sure, you have to reserve some pages if that ever happens, but what is 
the big selling point over PUNCH_HOLE + realloc? (sorry if I missed it 
and it was spelled out)
This project provides an essential component that can't be done keeping 
a large
page to replace a failed large page: an uncorrected memory error on a memory
page is a lost memory piece and needs to be identified for any user to 
indicate
the loss. The kernel granularity for that is the entire page. It marks it
'poisoned' making it inaccessible (no matter what the page size, or the lost
memory piece size). So recovering an area of a large page impacted by a 
memory
error has to keep track of the lost area, and there is no other way but to
lower the granularity and split the page into smaller pieces that can be
marked 'poisoned' for the lost area.

That's the reason why we can't replace a failed large page with another 
large page.

We need smaller pages.

This gives the possibility to:
- Take advantage of newer hypervisor kernel providing a way to 
retrieve
still valid data on the impacted hugetlbfs poisoned large page.

Reading that again, that shouldn't have to be hypervisor-specific. 
Really, if someone were to extract data from a poisoned hugetlb folio, 
it shouldn't be hypervisor-specific. The kernel should be able to know 
which regions are accessible and could allow ways for reading these, 
one way or the other.

It could just be a fairly hugetlb-special feature that would replace 
the poisoned page by a fresh hugetlb page where as much page content 
as possible has been recoverd from the old one.
I totally agree with the fact that it should be the Kernel role to split the
page and keep track of the valid and lost pieces. This was an aspect of the
high-granularity-mapping (HGM) project you are referring to. But HGM is not
there yet (and may never be), and currently the only automatic memory split
done by the kernel occurs when we are using Transparent Huge Pages (THP).
Unfortunately THP doesn't show (for the moment) all the performance and
memory optimisation possibilities that hugetlbfs use provides. And it's
a large topic I'd prefer not to get into.

How are you dealing with other consumers of the shared memory,
such as vhost-user processes,

In the current proposal, I don't deal with this aspect.
In fact, any other process sharing the changed memory will
continue to map the poisoned large page. So any access to
this page will generate a SIGBUS to this other process.

In this situation vhost-user processes should continue to receive
SIGBUS signals (and probably continue to die because of that).

That's ... suboptimal. :)
True.

Assume you have a 1 GiB page. The guest OS can happily allocate 
buffers in there so they can end up in vhost-user and crash that 
process. Without any warning.
I confess that I don't know how/when and where vhost-user processes get 
their
shared memory locations.
But I agree that a recovered large page is currently not usable to associate
new shared buffers between qemu and external processes.

Note that previously allocated buffers that could have been located on this
page are marked 'poisoned' (after a memory error)on the vhost-user process
the same way they were before this project . The only difference is that,
after a recovered memory error, qemu may continue to see the recovered
address space and use it. But the receiving side (on vhost-user) will fail
when accessing the location.

Can a vhost-process fail without any warning reported ?
I hope not.

So I do see a real problem if 2 qemu processes are sharing the
same hugetlbfs segment -- in this case, error recovery should not
occur on this piece of the memory. Maybe dealing with this situation
with "ivshmem" options is doable (marking the shared segment
"not eligible" to hugetlbfs recovery, just like not "share=on"
hugetlbfs entries are not eligible)
-- I need to think about this specific case.

Please let me know if there is a better way to deal with this
shared memory aspect and have a better system reaction.

Not creating the inconsistency in the first place :)
Yes :)
Of course I don't want to introduce any inconsistency situation leading to
a memory corruption.
But if we consider that 'ivshmem' memory is not eligible for a recovery,
it means that we still leave the entire large page location poisoned and
there would not be any inconsistency for this memory component. Other
hugetlbfs memory componentswould still have the possibility to be
partially recovered, and give a higher chance to the VM not to crash
immediately.

vm migration whereby RAM is migrated using file content,

Migration doesn't currently work with memory poisoning.
You can give a look at the already integrated following commit:

06152b89db64 migration: prevent migration when VM has poisoned memory

This proposal doesn't change anything on this side.

That commit is fairly fresh and likely missed the option to *not* 
migrate RAM by reading it, but instead by migrating it through a 
shared file. For example, VM life-upgrade (CPR) wants to use that (or 
is already using that), to avoid RAM migration completely.
When a memory error occurs on a dirty page used for a mapped file,
the data is lost and the file synchronisation should fail with EIO.
You can't rely on the file content to reflect the latest memory content.
So even a migration using such a file should be avoided according to me.

vfio that might have these pages pinned?

AFAIK even pinned memory can be impacted by memory error and poisoned
by the kernel. Now as I said in the cover letter, I'd like to know if
we should take extra care for IO memory, vfio configured memory 
buffers...

Assume your GPU has a hugetlb folio pinned via vfio. As soon as you 
make the guest RAM point at anything else as VFIO is aware of, we end 
up in the same problem we had when we learned about having to disable 
balloon inflation (MADVISE_DONTNEED) as soon as VFIO pinned pages.

We'd have to inform VFIO that the mapping is now different. Otherwise 
it's really better to crash the VM than having your GPU read/write 
different data than your CPU reads/writes,
Absolutely true, and fortunately this is not what would happen when the
large poisoned page is still used by the VFIO. After a successful recovery,
the CPU may still be able to read/write on a location where we had a vfio
buffer, but the other side (the device for example) would fail reading or
writing to any location of the poisoned large page.

In general, you cannot simply replace pages by private copies
when somebody else might be relying on these pages to go to
actual guest RAM.

This is correct, but the current proposal is dealing with a specific
shared memory type: poisoned large pages. So any other process mapping
this type of page can't access it without generating a SIGBUS.

Right, and that's the issue. Because, for example, how should the VM 
be aware that this memory is now special and must not be used for some 
purposes without leading to problems elsewhere?
That's an excellent question, that I don't have the full answer to. We are
dealing here with a hardware fault situation; the hugetlbfs backend file
still has poisoned large page, so any attempt to map it in a process, or any
process mapping it before the error will not be able to use the segment. It
doesn't mean that they get their own private copy of a page. The only one
getting a private copy (to get what was still valid on the faulted large 
page)
is qemu. So if we imagine that ivshmem segments (between 2 qemu processes)
don't get this recovery, I'm expecting the data exchange on this shared 
memory
to fail, just like they do without the recovery mechanism. So I don't expect
any established communication to continue to work or any new segment using
the recovered area to successfully being created.

But of course I could be missing something here and be too optimistic...

So let take a step back.

I guess these "sharing" questions would not relate to memory segments that
are not defined as 'share=on', am I right ?

Do ivshmem, vhost-user processes or even vfio only use 'share=on' memory 
segments ?

If yes, we could also imagine to only enable recovery for hugetlbfs segments
that do not have 'share=on' attribute, but we would have to map them 
MAP_SHARED
in qemu address space anyway. This can maybe create other kinds of 
problems (?),
but if these inconsistency questions would not appear with this approach it
would be easy to adapt, and still enhance hugetlbfs use. For a first version
of this feature.

It sounds very hacky and incomplete at first.

As you can see, RAS features need to be completed.
And if this proposal is incomplete, what other changes should be
done to complete it ?

I do hope we can discuss this RFC to adapt what is incorrect, or
find a better way to address this situation.

One long-term goal people are working on is to allow remapping the 
hugetlb folios in smaller granularity, such that only a single 
affected PTE can be marked as poisoned. (used to be called 
high-granularity-mapping)
I look forward to seeing this implemented, but it seems that it will 
take time
to appear, and if hugetlbfs RAS can be enhanced for qemu it would be 
very useful.

The day a kernel solution works, we can disable CONFIG_HUGETLBFS_RAS and 
rely on
the kernel to provide the appropriate information. The first commits will
continue to be necessary (dealing with si_addr_lsb value of the SIGBUS 
signinfo,
tracking the page size information in the hwpoison_page_list and the memory
remap on reset with the missing PUNCH_HOLE).

However, at the same time, the focus hseems to shift towards using 
guest_memfd instead of hugetlb, once it supports 1 GiB pages and 
shared memory. It will likely be easier to support mapping 1 GiB pages 
using PTEs that way, and there are ongoing discussions how that can be 
achieved more easily.

There are also discussions [1] about not poisoning the mappings at all 
and handling it differently. But I haven't yet digested how exactly 
that could look like in reality.

[1] https://lkml.kernel.org/r/20240828234958.GE3773488@xxxxxxxxxx

Thank you very much for this pointer. I hope a kernel solution (this one or
another) can be implemented and widely adopted before the next 5 to 10 
years ;)

In the meantime, we can try to enhance qemu using hugetlbfs for VM memory
which is more and more deployed.

Best regards,
William.