On 10/23/24 09:28, David Hildenbrand wrote:
On 22.10.24 23:35, “William Roche wrote:
From: William Roche <william.roche@xxxxxxxxxx>
Add the page size information to the hwpoison_page_list elements.
As the kernel doesn't always report the actual poisoned page size,
we adjust this size from the backend real page size.
We take into account the recorded page size to adjust the size
and location of the memory hole.
Signed-off-by: William Roche <william.roche@xxxxxxxxxx>
---
accel/kvm/kvm-all.c | 14 ++++++++++----
include/exec/cpu-common.h | 1 +
include/sysemu/kvm.h | 3 ++-
include/sysemu/kvm_int.h | 3 ++-
system/physmem.c | 20 ++++++++++++++++++++
target/arm/kvm.c | 8 ++++++--
target/i386/kvm/kvm.c | 8 ++++++--
7 files changed, 47 insertions(+), 10 deletions(-)
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 2adc4d9c24..40117eefa7 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, unsigned
int extension)
*/
typedef struct HWPoisonPage {
ram_addr_t ram_addr;
+ size_t page_size;
QLIST_ENTRY(HWPoisonPage) list;
} HWPoisonPage;
@@ -1278,15 +1279,18 @@ static void kvm_unpoison_all(void *param)
QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
QLIST_REMOVE(page, list);
- qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
+ qemu_ram_remap(page->ram_addr, page->page_size);
Can't we just use the page size from the RAMBlock in qemu_ram_remap?
There we lookup the RAMBlock, and all pages in a RAMBlock have the same
size.
Yes, we could use the page size from the RAMBlock in qemu_ram_remap()
that is called when the VM is resetting. I think that knowing the
information about the size of poisoned chunk of memory when the poison
is created is useful to give a trace of what is going on, before seeing
maybe other pages being reported as poisoned. That's the 4th patch goal
to give an information as soon as we get it.
It also helps to filter the new errors reported and only create an entry
in the hwpoison_page_list for new large pages.
Now we could delay the page size retrieval until we are resetting and
present the information (post mortem). I do think that having the
information earlier is better in this case.
I'll note that qemu_ram_remap() is rather stupid and optimized only for
private memory (not shmem etc).
mmap(MAP_FIXED|MAP_SHARED, fd) will give you the same poisoned page from
the pagecache; you'd have to punch a hole instead.
It might be better to use ram_block_discard_range() in the long run.
Memory preallocation + page pinning is tricky, but we could simply bail
out in these cases (preallocation failing, ram discard being disabled).
I see that ram_block_discard_range() adds more control before discarding
the RAM region and can also call madvise() in addition to the fallocate
punch hole for standard sized memory pages. Now as the range is supposed
to be recreated, I'm not convinced that these madvise calls are necessary.
But we can also notice that this function will report the following
warning in all cases of not shared file backends:
"ram_block_discard_range: Discarding RAM in private file mappings is
possibly dangerous, because it will modify the underlying file and will
affect other users of the file"
Which means that hugetlbfs configurations do see this new cryptic
warning message on reboot if it is impacted by a memory poisoning.
So I would prefer to leave the fallocate call in the qemu_ram_remap()
function. Or would you prefer to enhance ram_block_discard_range() code
to avoid the message in a reset situation (when called from
qemu_ram_remap) ?
qemu_ram_remap() might be problematic with page pinning (vfio) as is in
any way :(
I agree. If qemu_ram_remap() fails, Qemu is ended either abort() or
exit(1). Do you say that memory pinning could be detected by
ram_block_discard_range() or maybe mmap call for the impacted region and
make one of them fail ? This would be an additional reason to call
ram_block_discard_range() from qemu_ram_remap(). Is it what you are
suggesting ?