Re: [PATCH v1 3/4] system/physmem: Largepage punch hole before reset of memory pages

David Hildenbrand <david@xxxxxxxxxx> · Mon, 4 Nov 2024 14:30:27 +0100

Remapping the page is needed to get rid of the poison. So if we want to
avoid the mmap(), we have to shrink the memory address space -- which
can be a real problem if we imagine a VM with 1G large pages for
example. qemu_ram_remap() is used to regenerate the lost memory and the
mmap() call looks mandatory on the reset phase.

Why can't we use ram_block_discard_range() to zap the poisoned page
(unmap from page tables + conditionally drop from the page cache)? Is
there anything important I am missing?

Or maybe _I'm_ missing something important, but what I understand is that:
     need_madvise = (rb->page_size == qemu_real_host_page_size());

ensures that the madvise call on ram_block_discard_range() is not done
in the case off hugepages.
In this case, we need to call mmap the remap the hugetlbfs large page.

Right, madvise(DONTNEED) works ever since "90e7e7f5ef3f ("mm: enable 
MADV_DONTNEED for hugetlb mappings")".

But as you note, in QEMU we never called madvise(DONTNEED) for hugetlb 
as of today. But note that we always have an "fd" with hugetlb, because 
we never use mmap(MAP_ANON|MAP_PRIVATE|MAP_HUGETLB) in QEMU.

The weird thing is that if you have a mmap(fd, MAP_PRIVATE) hugetlb 
mapping, fallocate(fd, FALLOC_FL_PUNCH_HOLE) will *also* zap any private 
pages. So in contrast to "ordinary" memory, the madvise(DONTNEED) is not 
required.

(yes, it's very weird)

So the fallocate(fd, FALLOC_FL_PUNCH_HOLE) will zap the hugetlb page and 
you will get a fresh one on next fault.

For all the glorious details, see:

https://lore.kernel.org/linux-mm/2ddd0a26-33fd-9cde-3501-f0584bbffefc@xxxxxxxxxx/

As I said in the previous email, recent kernels start to implement these
calls for hugetlbfs, but I'm not sure that changing the mechanism of
this ram_block_discard_range() function now is appropriate.
Do you agree with that ?

The key point is that it works for hugetlb without madvise(DONTNEED), 
which is weird :)

Which is also why the introducing kernel change added "Do note that 
there is no compelling use case for adding this support.
This was discussed in the RFC [1].  However, adding support makes sense
as it is fairly trivial and brings hugetlb functionality more in line
with 'normal' memory."

[...]

So one would implement a ram_block_notify_remap() and maybe indicate if
we had to do MAP_FIXED or if we only discarded the page.

I once had a prototype for that, let me dig ...

That would be great !  Thanks.

Found them:

https://gitlab.com/virtio-mem/qemu/-/commit/f528c861897d1086ae84ea1bcd6a0be43e8fea7d

https://gitlab.com/virtio-mem/qemu/-/commit/c5b0328654def8f168497715409d6364096eb63f

https://gitlab.com/virtio-mem/qemu/-/commit/15e9737907835105c132091ad10f9d0c9c68ea64

But note that I didn't realize back then that the mmap(MAP_FIXED) is the 
wrong way to do it, and that we actually have to DONTNEED/PUNCH_HOLE to 
do it properly. But to get the preallocation performed by the backend, 
it should still be valuable.

Note that I wonder if we can get rid of the mmap(MAP_FIXED) handling 
completely: likely we only support Linux with MCE recovery, and 
ram_block_discard_range() should do what we need under Linux.

That would make it a lot simpler.

I can send a new version using ram_block_discard_range() as you
suggested to replace the direct call to fallocate(), if you think it
would be better.
Please let me know what other enhancement(s) you'd like to see in this
code change.

Something along the lines above. Please let me know if you see problems
with that approach that I am missing.

Let me check the madvise use on hugetlbfs and if it works as expected,
I'll try to implement a V2 version of the fix proposal integrating a
modified ram_block_discard_range() function.

As discussed, it might all be working. If not, we would have to fix 
ram_block_discard_range().

I'll also remove the page size information from the signal handlers
and only keep it in the kvm_hwpoison_page_add() function.

That's good. Especially because there was talk in the last bi-weekly MM 
sync [1] about possibly indicating only the actually failed cachelines 
in the future, not necessarily the full page.

So relying on that interface to return the actual pagesize would no be 
future proof.

That session was in general very interesting and very relevant for your 
work; did you by any chance attend it? If not, we should find you the 
recordings, because the idea is to be able to configure to 
not-unmap-during-mce, and instead only inform the guest OS about the MCE 
(forward it). Which avoids any HGM (high-granularity mapping) issues 
completely.

Only during reboot of the VM we will have to do exactly what is being 
done in this series: zap the whole *page* so our fresh OS will see "all 
non-faulty" memory.

[1] 
https://lkml.kernel.org/r/9242f7cc-6b9d-b807-9079-db0ca81f3c6d@xxxxxxxxxx

I'll investigate how to keep track of the 'prealloc' attribute to
optionally use when remapping the hugepages (on older kernels).
And if you find the prototype code you talked about that would
definitely help :)

Right, the above should help getting that sorted out (but code id 4 
years old, so it won't "just apply").

--
Cheers,

David / dhildenb