On 27.01.22 18:55, Mike Kravetz wrote: > On 1/27/22 03:57, David Hildenbrand wrote: >> On 13.01.22 19:03, Mike Kravetz wrote: >>> Userfaultfd selftests for hugetlb does not perform UFFD_EVENT_REMAP >>> testing. However, mremap support was recently added in commit >>> 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed >>> vma"). While attempting to enable mremap support in the test, it was >>> discovered that the mremap test indirectly depends on MADV_DONTNEED. >>> >>> hugetlb does not support MADV_DONTNEED. However, the only thing >>> preventing support is a check in can_madv_lru_vma(). Simply removing >>> the check will enable support. >>> >>> This is sent as a RFC because there is no existing use case calling >>> for hugetlb MADV_DONTNEED support except possibly the userfaultfd test. >>> However, adding support makes sense as it is fairly trivial and brings >>> hugetlb functionality more in line with 'normal' memory. >>> >> >> Just a note: >> >> QEMU doesn't use huge anonymous memory directly (MAP_ANON | MAP_HUGE...) >> but instead always goes either via hugetlbfs or via memfd. >> >> For MAP_PRIVATE hugetlb mappings, fallocate(FALLOC_FL_PUNCH_HOLE) seems >> to get the job done (IOW: also discards private anon pages). See the >> comments in the QEMU code below. I remember that that is somewhat >> inconsistent. For ordinary MAP_PRIVATE mapped files I remember that we >> always need fallocate(FALLOC_FL_PUNCH_HOLE) + madvise(QEMU_MADV_DONTNEED) >> to make sure >> >> a) All file pages are removed >> b) All private anon pages are removed >> >> IIRC hugetlbfs really is different in that regard, but maybe other fs >> behave similarly. > > Yes it is really different. And, some might even consider that a bug? > Imagine if those private anon (COW) pages contain important data. They > could be unmapped/freed by some other process that has write access to > the hugetlb file on which the private mapping is based. Right, that's also what I once worried about in QEMU code. But then I realized that any kind of fallocate(FALLOC_FL_PUNCH_HOLE) on a file shared by multiple parties mapped MAP_PRIVATE might be bogus already. Assume you have a VM running with MAP_SHARED on a file. The file contains the VM memory state. Assume you pause the VM and want to convert it into 2 instances that will continue running independently based on the captured file state. You'd have to MAP_PRIVATE the file such that both VMs start with the original state and only see their modifications. ... but if one process decides to fallocate(FALLOC_FL_PUNCH_HOLE), for example, due to memory ballooning, even a page that's still shared by both processes (!COW), you'd corrupt the other VM. So my assumption is that MAP_PRIVATE in combination with fallocate(FALLOC_FL_PUNCH_HOLE) on a file mapped by more than one process is just bogus already. > > I believe this same issue exists for hugetlbfs ftruncate. When fallocate > hole punch support was added, it was based on the ftruncate functionality. > > I am hesitant to change the behavior of hugetlb hole punch or truncate > as people may be relying on that behavior today. Your QEMU example is > one such case. Yes, I assume we're stuck with that. -- Thanks, David / dhildenb