On 26.10.24 01:27, William Roche wrote:
On 10/23/24 09:30, David Hildenbrand wrote:
On 22.10.24 23:35, “William Roche wrote:
From: William Roche <william.roche@xxxxxxxxxx>
When the VM reboots, a memory reset is performed calling
qemu_ram_remap() on all hwpoisoned pages.
While we take into account the recorded page sizes to repair the
memory locations, a large page also needs to punch a hole in the
backend file to regenerate a usable memory, cleaning the HW
poisoned section. This is mandatory for hugetlbfs case for example.
Signed-off-by: William Roche <william.roche@xxxxxxxxxx>
---
system/physmem.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/system/physmem.c b/system/physmem.c
index 3757428336..3f6024a92d 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2211,6 +2211,14 @@ void qemu_ram_remap(ram_addr_t addr,
ram_addr_t length)
prot = PROT_READ;
prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
if (block->fd >= 0) {
+ if (length > TARGET_PAGE_SIZE &&
fallocate(block->fd,
+ FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+ offset + block->fd_offset, length) != 0) {
+ error_report("Could not recreate the file
hole for "
+ "addr: " RAM_ADDR_FMT "@"
RAM_ADDR_FMT "",
+ length, addr);
+ exit(1);
+ }
area = mmap(vaddr, length, prot, flags, block->fd,
offset + block->fd_offset);
} else {
Ah! Just what I commented to patch #3; we should be using
ram_discard_range(). It might be better to avoid the mmap() completely
if ram_discard_range() worked.
Hi!
I think you are referring to ram_block_discard_range() here, as
ram_discard_range() seems to relate to VM migrations, maybe not a VM reset.
Please take a look at the users of ram_block_discard_range(), including
virtio-balloon to completely zap guest memory, so we will get fresh
memory on next access. It takes care of process-private and file-backed
(shared) memory.
Remapping the page is needed to get rid of the poison. So if we want to
avoid the mmap(), we have to shrink the memory address space -- which
can be a real problem if we imagine a VM with 1G large pages for
example. qemu_ram_remap() is used to regenerate the lost memory and the
mmap() call looks mandatory on the reset phase.
Why can't we use ram_block_discard_range() to zap the poisoned page
(unmap from page tables + conditionallydrop from the page cache)? Is
there anything important I am missing?
And as raised, there is the problem with memory preallocation (where
we should fail if it doesn't work) and ram discards being disabled
because something relies on long-term page pinning ...
Yes. Do you suggest that we add a call to qemu_prealloc_mem() for the
remapped area in case of a backend->prealloc being true ?
Yes. Otherwise, with hugetlb, you might run out of hugetlb pages at
runtime and SIGBUS QEMU :(
Or as we are running on posix machines for this piece of code (ifndef
_WIN32) maybe we could simply add a MAP_POPULATE flag to the mmap call
done in qemu_ram_remap() in the case where the backend requires a
'prealloc' ? Can you confirm if this flag could be used on all systems
running this code ?
Please use qemu_prealloc_mem(). MAP_POPULATE has no guarantees, it's
really weird :/ mmap() might succeed even though MAP_POPULATE didn't
work ... and it's problematic with NUMA policies because we essentially
lose (overwrite) them.
And the whole mmap(MAP_FIXED) is an ugly hack. For example, we wouldn't
reset the memory policy we apply in
host_memory_backend_memory_complete() ... that code really needs a
rewrite to do it properly.
Ideally, we'd do something high-level like
if (ram_block_discard_is_disabled()) {
/*
* We cannot safely discard RAM, ... for example we might have
* to remap all guest RAM into vfio after discarding the
* problematic pages ... TODO.
*/
exit(0);
}
/* Throw away the problematic (poisoned) page. *./
if (ram_block_discard_range()) {
/* Conditionally fallback to MAP_FIXED workaround */
...
}
/* If prealloction was requested, we really must re-preallcoate. */
if (prealloc && qemu_prealloc_mem()) {
/* Preallocation failed .... */
exit(0);
}
As you note the last part is tricky. See bwloe.
Unfortunately, I don't know how to get the MEMORY_BACKEND corresponding
to a given memory block. I'm not sure that MEMORY_BACKEND(block->mr) is
a valid way to retrieve the Backend object and its 'prealloc' property
here. Could you please give me a direction here ?
We could add a RAM_PREALLOC flag to hint that this memory has "prealloc"
semantics.
I once had an alternative approach: Similar to ram_block_notify_resize()
we would implement ram_block_notify_remap().
That's where the backend could register and re-apply mmap properties
like NUMA policies (in case we have to fallback to MAP_FIXED) and handle
the preallocation.
So one would implement a ram_block_notify_remap() and maybe indicate if
we had to do MAP_FIXED or if we only discarded the page.
I once had a prototype for that, let me dig ...
I can send a new version using ram_block_discard_range() as you
suggested to replace the direct call to fallocate(), if you think it
would be better.
Please let me know what other enhancement(s) you'd like to see in this
code change.
Something along the lines above. Please let me know if you see problems
with that approach that I am missing.
--
Cheers,
David / dhildenb