Re: [PATCH v2 3/7] accel/kvm: Report the loss of a large memory page

William Roche <william.roche@xxxxxxxxxx> · Tue, 12 Nov 2024 19:17:36 +0100

On 11/12/24 12:13, David Hildenbrand wrote:
On 07.11.24 11:21, “William Roche wrote:
From: William Roche <william.roche@xxxxxxxxxx>

When an entire large page is impacted by an error (hugetlbfs case),
report better the size and location of this large memory hole, so
give a warning message when this page is first hit:
Memory error: Loosing a large page (size: X) at QEMU addr Y and GUEST 
addr Z

Hm, I wonder if we really want to special-case hugetlb here.

Why not make the warning independent of the underlying page size?

We already have a warning provided by Qemu (in kvm_arch_on_sigbus_vcpu()):

Guest MCE Memory Error at QEMU addr Y and GUEST addr Z of type 
BUS_MCEERR_AR/_AO injected

The one I suggest is an additional message provided before the above 
message.

Here is an example:
qemu-system-x86_64: warning: Memory error: Loosing a large page (size: 
2097152) at QEMU addr 0x7fdd7d400000 and GUEST addr 0x11600000
qemu-system-x86_64: warning: Guest MCE Memory Error at QEMU addr 
0x7fdd7d400000 and GUEST addr 0x11600000 of type BUS_MCEERR_AO injected

According to me, this large page case additional message will help to 
better understand the probable sudden proliferation of memory errors 
that can be reported by Qemu on the impacted range.
Not only will the machine administrator identify better that a single 
memory error had this large impact, it can also help us to better 
measure the impact of fixing the large page memory error support in the 
field (in the future).

These are some reasons why I do think this large page specific message 
can be useful.