On 10/28/24 17:42, David Hildenbrand wrote:
On 26.10.24 01:27, William Roche wrote:
On 10/23/24 09:28, David Hildenbrand wrote:
On 22.10.24 23:35, “William Roche wrote:
From: William Roche <william.roche@xxxxxxxxxx>
Add the page size information to the hwpoison_page_list elements.
As the kernel doesn't always report the actual poisoned page size,
we adjust this size from the backend real page size.
We take into account the recorded page size to adjust the size
and location of the memory hole.
Signed-off-by: William Roche <william.roche@xxxxxxxxxx>
---
accel/kvm/kvm-all.c | 14 ++++++++++----
include/exec/cpu-common.h | 1 +
include/sysemu/kvm.h | 3 ++-
include/sysemu/kvm_int.h | 3 ++-
system/physmem.c | 20 ++++++++++++++++++++
target/arm/kvm.c | 8 ++++++--
target/i386/kvm/kvm.c | 8 ++++++--
7 files changed, 47 insertions(+), 10 deletions(-)
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 2adc4d9c24..40117eefa7 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s,
unsigned int extension)
*/
typedef struct HWPoisonPage {
ram_addr_t ram_addr;
+ size_t page_size;
QLIST_ENTRY(HWPoisonPage) list;
} HWPoisonPage;
@@ -1278,15 +1279,18 @@ static void kvm_unpoison_all(void *param)
QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list,
next_page) {
QLIST_REMOVE(page, list);
- qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
+ qemu_ram_remap(page->ram_addr, page->page_size);
Can't we just use the page size from the RAMBlock in qemu_ram_remap?
There we lookup the RAMBlock, and all pages in a RAMBlock have the
same size.
Yes, we could use the page size from the RAMBlock in qemu_ram_remap()
that is called when the VM is resetting. I think that knowing the
information about the size of poisoned chunk of memory when the poison
is created is useful to give a trace of what is going on, before
seeing maybe other pages being reported as poisoned. That's the 4th
patch goal to give an information as soon as we get it.
It also helps to filter the new errors reported and only create an
entry in the hwpoison_page_list for new large pages.
Now we could delay the page size retrieval until we are resetting and
present the information (post mortem). I do think that having the
information earlier is better in this case.
If it is not required for this patch, then please move the other stuff
to patch #4.
Here, we really only have to discard a large page, which we can derive
from the QEMU RAMBlock page size.
Ok, I can remove the first patch that is created to track the kernel
provided page size and pass it to the kvm_hwpoison_page_add() function,
but we could deal with the page size at the kvm_hwpoison_page_add()
function level as we don't rely on the kernel provided info, but just
the RAMBlock page size.
I'll send a new version with this modification.
I'll note that qemu_ram_remap() is rather stupid and optimized only
for private memory (not shmem etc).
mmap(MAP_FIXED|MAP_SHARED, fd) will give you the same poisoned page
from the pagecache; you'd have to punch a hole instead.
It might be better to use ram_block_discard_range() in the long run.
Memory preallocation + page pinning is tricky, but we could simply
bail out in these cases (preallocation failing, ram discard being
disabled).
I see that ram_block_discard_range() adds more control before
discarding the RAM region and can also call madvise() in addition to
the fallocate punch hole for standard sized memory pages. Now as the
range is supposed to be recreated, I'm not convinced that these
madvise calls are necessary.
They are the proper replacement for the mmap(MAP_FIXED) + fallocate.
That function handles all cases of properly discarding guest RAM.
In the case of hugetlbfs pages, ram_block_discard_range() does the
punch-hole fallocate call (and prints out the warning messages).
The madvise call is only done when (rb->page_size ==
qemu_real_host_page_size()) which isn't true for hugetlbfs.
So need_madvise is false and neither QEMU_MADV_REMOVE nor
QEMU_MADV_DONTNEED madvise calls is performed.
But we can also notice that this function will report the following
warning in all cases of not shared file backends:
"ram_block_discard_range: Discarding RAM in private file mappings is
possibly dangerous, because it will modify the underlying file and
will affect other users of the file"
Yes, because it's a clear warning sign that something weird is
happening. You might be throwing away data that some other process might
be relying on.
How are you making QEMU consume hugetlbs?
A classical way to consume (not shared) hugetlbfs pages is done with the
creation of a file that is opened, mmapped by the Qemu instance but we
also delete the file system entry so that if the Qemu instance dies, the
resources are released. This file is usually not shared.
We could suppress these warnings, but let's first see how you are able
to trigger it.
The warning is always displayed when such a hugetlbfs VM impacted by a
memory error is rebooted.
I understand the reason why we have this message, but in the case of
hugetlbfs classical use this (new) message on reboot is probably too
worrying... But loosing memory is already very worrying ;)
Which means that hugetlbfs configurations do see this new cryptic
warning message on reboot if it is impacted by a memory poisoning.
So I would prefer to leave the fallocate call in the qemu_ram_remap()
function. Or would you prefer to enhance ram_block_discard_range()code
to avoid the message in a reset situation (when called from
qemu_ram_remap)?
Please try reusing the mechanism to discard guest RAM instead of open-
coding this. We still have to use mmap(MAP_FIXED) as a backup, but
otherwise this function should mostly do+check what you need.
(-warnings we might want to report differently / suppress)
If you want, I can start a quick prototype of what it could look like
when using ram_block_discard_range() + ram_block_discard_is_disabled() +
fallback to existing mmap(MAP_FIXED).
I just want to notice that the reason why need_madvise was used was
because "DONTNEED fails for hugepages but fallocate works on hugepages
and shmem". In fact, MADV_REMOVE support on hugetlbfs only appeared in
kernel v4.3 and MADV_DONTNEED support only appeared 5.18
Our Qemu code avoids calling these madvise for hugepages, as we need to
have:
(rb->page_size == qemu_real_host_page_size())
That's a reason why we have to remap the "hole-punched" section of the
file when using hugepages.
qemu_ram_remap() might be problematic with page pinning (vfio) as is
in any way :(
I agree. If qemu_ram_remap() fails, Qemu is ended either abort() or
exit(1). Do you say that memory pinning could be detected by
ram_block_discard_range() or maybe mmap call for the impacted region
and make one of them fail ? This would be an additional reason to call
ram_block_discard_range() from qemu_ram_remap(). Is it what you are
suggesting ?
ram_block_discard_is_disabled() might be the right test. If discarding
is disabled, then rebooting might create an inconsistency with
e.g.,vfio, resulting in the issues we know from memory ballooning where
the state vfio sees will be different from the state the guest kernel
sees. It's tricky ... and we much rather quit the VM early instead of
corrupting data later :/
Alright. we can verify if ram_block_discard_is_disabled() is true and we
exit Qemu in this case with a message instead of trying to recreate the
memory area (in the other case).