Re: [PATCH v2 3/7] accel/kvm: Report the loss of a large memory page

David Hildenbrand <david@xxxxxxxxxx> · Mon, 18 Nov 2024 10:45:25 +0100

Hm, I think we should definitely be including the size in the existing
one. That code was written without huge pages in mind.

Yes we can do that, and get the page size at this level to pass as a
'page_sise' argument to kvm_hwpoison_page_add().

It would make the message longer as we will have the extra information
about the large page on all messages when an error impacts a large page.
We could change the messages only when we are dealing with a large page,
so that the standard (4k) case isn't modified.

Right. And likely we should call it "huge page" instead, which is the 
Linux term for anything larger than a single page.

[...]

With the "large page" hint you can highlight that this is special.

Right, we can do it that way. It also gives the impression that we
somehow inject errors on a large range of the memory. Which is not the
case. I'll send a proposal with a different formulation, so that you can
choose.

Make sense.

On a related note ...I think we have a problem. Assume we got a SIGBUS
on a huge page (e.g., somewhere in a 1 GiB page).

We will call kvm_mce_inject(cpu, paddr, code) /
acpi_ghes_record_errors(ACPI_HEST_SRC_ID_SEA, paddr)

But where is the size information? :// Won't the VM simply assume that
there was a MCE on a single 4k page starting at paddr?

This is absolutely right !
It's exactly what happens: The VM kernel received the information and
considers that only the impacted page has to be poisoned.
> > That's also the reason why Qemu repeats the error injections every time
the poisoned large page is accessed (for all other touched 4k pages
located on this "memory hole").

:/

So we always get from Linux the full 1Gig range and always report the 
first 4k page essentially, on any such access, right?

BTW, should we handle duplicates in our poison list?

I'm not sure if we can inject ranges, or if we would have to issue one
MCE per page ... hm, what's your take on this?

I don't know of any size information about a memory error reported by
the hardware. The kernel doesn't seem to expect any such information.
It explains why there is no impact/blast size information provided when
an error is relayed to the VM.

We could take the "memory hole" size into account in Qemu, but repeating
error injections is not going to help a lot either: We'd need to give
the VM some time to deal with an error injection before producing a new
error for the next page etc... in the case (x86 only) where an

I had the same thoughts.

asynchronous error is relayed with BUS_MCEERR_AO, we would also have to
repeat the error for all the 4k pages located on the lost large page too.

We can see that the Linux kernel has some mechanisms to deal with a
seldom 4k page loss, but a larger blast is very likely to crash the VM
(which is fine).

Right, and that will inevitably happen when we get a MVE on a 1GiG 
hugetlb page, correct? The whole thing will be inaccessible.

And as a significant part of the memory is no longer
accessible, dealing with the error itself can be impaired and we
increase the risk of loosing data, even though most of the memory on the
large page could still be used.

Now if we can recover the 'still valid' memory of the impacted large
page, we can significantly reduce this blast and give a much better
chance to the VM to survive the incident or crash more gracefully.

Right. That cannot be sorted out in user space alone, unfortunately.

I've looked at the project you indicated me, which is not ready to be
adopted:
https://lore.kernel.org/linux-mm/20240924043924.3562257-2-jiaqiyan@xxxxxxxxxx/T/

Yes, that goes into a better direction, though.

But we see that, this large page enhancement is needed, sometimes just
to give a chance to the VM to survive a little longer before being
terminated or moved.
Injecting multiple MCEs or ACPI error records doesn't help, according to me.

I suspect that in most cases, when we get an MCE on a 1Gig page in the 
hypervisor, our running Linux guest will soon crash, because it really 
lost 1 Gig of contiguous memory. :(

--
Cheers,

David / dhildenb