On 11/12/24 12:07, David Hildenbrand wrote:
On 07.11.24 11:21, “William Roche wrote:
From: William Roche <william.roche@xxxxxxxxxx>
We take into account the recorded page sizes to repair the
memory locations, calling ram_block_discard_range() to punch a hole
in the backend file when necessary and regenerate a usable memory.
Fall back to unmap/remap the memory location(s) if the kernel doesn't
support the madvise calls used by ram_block_discard_range().
Hugetlbfs poison case is also taken into account as a hole punch
with fallocate will reload a new page when first touched.
Signed-off-by: William Roche <william.roche@xxxxxxxxxx>
---
system/physmem.c | 50 +++++++++++++++++++++++++++++-------------------
1 file changed, 30 insertions(+), 20 deletions(-)
diff --git a/system/physmem.c b/system/physmem.c
index 750604d47d..dfea120cc5 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2197,27 +2197,37 @@ void qemu_ram_remap(ram_addr_t addr,
ram_addr_t length)
} else if (xen_enabled()) {
abort();
} else {
- flags = MAP_FIXED;
- flags |= block->flags & RAM_SHARED ?
- MAP_SHARED : MAP_PRIVATE;
- flags |= block->flags & RAM_NORESERVE ?
MAP_NORESERVE : 0;
- prot = PROT_READ;
- prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
- if (block->fd >= 0) {
- area = mmap(vaddr, length, prot, flags, block->fd,
- offset + block->fd_offset);
- } else {
- flags |= MAP_ANONYMOUS;
- area = mmap(vaddr, length, prot, flags, -1, 0);
- }
- if (area != vaddr) {
- error_report("Could not remap addr: "
- RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
- length, addr);
- exit(1);
+ if (ram_block_discard_range(block, offset + block-
>fd_offset,
+ length) != 0) {
+ if (length > TARGET_PAGE_SIZE) {
+ /* punch hole is mandatory on hugetlbfs */
+ error_report("large page recovery failure
addr: "
+ RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+ length, addr);
+ exit(1);
+ }
For shared memory we really need it.
Private file-backed is weird ... because we don't know if the shared or
the private page is problematic ... :(
I agree with you, and we have to decide when should we bail out if
ram_block_discard_range() doesn't work.
According to me, if discard doesn't work and we are dealing with
file-backed largepages (shared or not) we have to exit, because the
fallocate is mandatory. It is the case with hugetlbfs.
In the non-file-backed case, or the file-backed non-largepage private
case, according to me we can trust the mmap() method to put everything
back in place for the VM reset to work as expected.
Are there aspects I don't see, and for which mmap + the remap handler is
not sufficient and we should also bail out here ?
Maybe we should just do:
if (block->fd >= 0) {
/* mmap(MAP_FIXED) cannot reliably zap our problematic page. */
error_report(...);
exit(-1);
}
Or alternatively
if (block->fd >= 0 && qemu_ram_is_shared(block)) {
/* mmap() cannot possibly zap our problematic page. */
error_report(...);
exit(-1);
} else if (block->fd >= 0) {
/*
* MAP_PRIVATE file-backed ... mmap() can only zap the private
* page, not the shared one ... we don't know which one is
* problematic.
*/
warn_report(...);
}
I also agree that any file-backed/shared case should bail out if discard
(fallocate) fails, no mater large or standard pages are used.
In the case of file-backed private standard pages, I think that a poison
on the private page can be fixed with a new mmap.
According to me, there are 2 cases to consider: at the moment the poison
is seen, the page was dirty (so it means that it was a pure private
page), or the page was not dirty, and in this case the poison could
replace this non-dirty page with a new copy of the file content.
In both cases, I'd say that the remap should clean up the poison.
So the conditions when discard fails, could be something like:
if (block->fd >= 0 && (qemu_ram_is_shared(block) ||
(length > TARGET_PAGE_SIZE))) {
/* punch hole is mandatory, mmap() cannot possibly zap our page*/
error_report("%spage recovery failure addr: "
RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
(length > TARGET_PAGE_SIZE) ? "large " : "",
length, addr);
exit(1);
}
+ flags = MAP_FIXED;
+ flags |= block->flags & RAM_SHARED ?
+ MAP_SHARED : MAP_PRIVATE;
+ flags |= block->flags & RAM_NORESERVE ?
MAP_NORESERVE : 0;
+ prot = PROT_READ;
+ prot |= block->flags & RAM_READONLY ? 0 :
PROT_WRITE;
+ if (block->fd >= 0) {
+ area = mmap(vaddr, length, prot, flags,
block->fd,
+ offset + block->fd_offset);
+ } else {
+ flags |= MAP_ANONYMOUS;
+ area = mmap(vaddr, length, prot, flags, -1, 0);
+ }
+ if (area != vaddr) {
+ error_report("Could not remap addr: "
+ RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+ length, addr);
+ exit(1);
+ }
+ memory_try_enable_merging(vaddr, length);
+ qemu_ram_setup_dump(vaddr, length);
Can we factor the mmap hack out into a separate helper function to clean
this up a bit?
Sure, I'll do that.