On 12.11.24 19:17, William Roche wrote:
On 11/12/24 12:13, David Hildenbrand wrote:
On 07.11.24 11:21, “William Roche wrote:
From: William Roche <william.roche@xxxxxxxxxx>
When an entire large page is impacted by an error (hugetlbfs case),
report better the size and location of this large memory hole, so
give a warning message when this page is first hit:
Memory error: Loosing a large page (size: X) at QEMU addr Y and GUEST
addr Z
Hm, I wonder if we really want to special-case hugetlb here.
Why not make the warning independent of the underlying page size?
We already have a warning provided by Qemu (in kvm_arch_on_sigbus_vcpu()):
Guest MCE Memory Error at QEMU addr Y and GUEST addr Z of type
BUS_MCEERR_AR/_AO injected
The one I suggest is an additional message provided before the above
message.
Here is an example:
qemu-system-x86_64: warning: Memory error: Loosing a large page (size:
2097152) at QEMU addr 0x7fdd7d400000 and GUEST addr 0x11600000
qemu-system-x86_64: warning: Guest MCE Memory Error at QEMU addr
0x7fdd7d400000 and GUEST addr 0x11600000 of type BUS_MCEERR_AO injected
Hm, I think we should definitely be including the size in the existing
one. That code was written without huge pages in mind.
We should similarly warn in the arm implementation (where I don't see a
similar message yet).
According to me, this large page case additional message will help to
better understand the probable sudden proliferation of memory errors
that can be reported by Qemu on the impacted range.
Not only will the machine administrator identify better that a single
memory error had this large impact, it can also help us to better
measure the impact of fixing the large page memory error support in the
field (in the future).
What about extending the existing one to something like
warning: Guest MCE Memory Error at QEMU addr $ADDR and GUEST $PADDR of
type BUS_MCEERR_AO and size $SIZE (large page) injected
With the "large page" hint you can highlight that this is special.
On a related note ...I think we have a problem. Assume we got a SIGBUS
on a huge page (e.g., somewhere in a 1 GiB page).
We will call kvm_mce_inject(cpu, paddr, code) /
acpi_ghes_record_errors(ACPI_HEST_SRC_ID_SEA, paddr)
But where is the size information? :// Won't the VM simply assume that
there was a MCE on a single 4k page starting at paddr?
I'm not sure if we can inject ranges, or if we would have to issue one
MCE per page ... hm, what's your take on this?
--
Cheers,
David / dhildenb