On Thu, Sep 19, 2024 at 06:52:37PM +0200, William Roche wrote: > Hello David, > > I hope my last week email answered your interrogations about: > - retrieving the valid data from the lost hugepage > - the need of smaller pages to replace a failed large page > - the interaction of memory error and VM migration > - the non-symmetrical access to a poisoned memory area after a recovery > Qemu would be able to continue to access the still valid data > location of the formerly poisoned hugepage, but any other entity > mapping the large page would not be allowed to use the location. > > I understand that this last item _is_ some kind of "inconsistency". > So if I want to make sure that a "shared" memory region (used for vhost-user > processes, vfio or ivshmem) is not recovered, how can I identify what > region(s) > of a guest memory could be used for such a shared location ? > Is there a way for qemu to identify the memory locations that have been > shared ? When there's no vIOMMU I think all guest pages need to be shared. When with vIOMMU it depends on what was mapped by the guest drivers, while in most sane setups they can still always be shared because the guest OS (if Linux) should normally have iommu=pt speeding up kernel drivers. > > Could you please let me know if there is an entry point I should consider ? IMHO it'll still be more reasonable that this issue be tackled from the kernel not userspace, simply because it's a shared problem of all userspaces rather than QEMU process alone. When with that the kernel should guarantee consistencies on different processes accessing these pages properly, so logically all these complexities should be better done in the kernel once for all. There's indeed difficulties on providing it in hugetlbfs with mm community, and this is also not the only effort trying to fix 1G page poisoning with userspace workarounds, see: https://lore.kernel.org/r/20240924043924.3562257-1-jiaqiyan@xxxxxxxxxx My gut feeling is either hugetlbfs needs to be fixed (with less hope) or QEMU in general needs to move over to other file systems on consuming huge pages. Poisoning is not the only driven force, but at least we want to also work out postcopy which has similar goal as David said, on being able to map hugetlbfs pages differently. May consider having a look at gmemfd 1G proposal, posted here: https://lore.kernel.org/r/cover.1726009989.git.ackerleytng@xxxxxxxxxx We probably need that in one way or another for CoCo, and the chance is it can easily support non-CoCo with the same interface ultimately. Then 1G hugetlbfs can be abandoned in QEMU. It'll also need to tackle the same challenge here either on page poisoning, or postcopy, with/without QEMU's specific solution, because QEMU is also not the only userspace hypervisor. Said that, the initial few small patches seem to be standalone small fixes which may still be good. So if you think that's the case you can at least consider sending them separately without RFC tag. Thanks, -- Peter Xu