On 13.01.22 19:03, Mike Kravetz wrote: > Userfaultfd selftests for hugetlb does not perform UFFD_EVENT_REMAP > testing. However, mremap support was recently added in commit > 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed > vma"). While attempting to enable mremap support in the test, it was > discovered that the mremap test indirectly depends on MADV_DONTNEED. > > hugetlb does not support MADV_DONTNEED. However, the only thing > preventing support is a check in can_madv_lru_vma(). Simply removing > the check will enable support. > > This is sent as a RFC because there is no existing use case calling > for hugetlb MADV_DONTNEED support except possibly the userfaultfd test. > However, adding support makes sense as it is fairly trivial and brings > hugetlb functionality more in line with 'normal' memory. > Just a note: QEMU doesn't use huge anonymous memory directly (MAP_ANON | MAP_HUGE...) but instead always goes either via hugetlbfs or via memfd. For MAP_PRIVATE hugetlb mappings, fallocate(FALLOC_FL_PUNCH_HOLE) seems to get the job done (IOW: also discards private anon pages). See the comments in the QEMU code below. I remember that that is somewhat inconsistent. For ordinary MAP_PRIVATE mapped files I remember that we always need fallocate(FALLOC_FL_PUNCH_HOLE) + madvise(QEMU_MADV_DONTNEED) to make sure a) All file pages are removed b) All private anon pages are removed IIRC hugetlbfs really is different in that regard, but maybe other fs behave similarly. That's why QEMU was able to live for now without MADV_DONTNEED support for hugetlbfs and most probably won't ever need it. ... /* The logic here is messy; * madvise DONTNEED fails for hugepages * fallocate works on hugepages and shmem * shared anonymous memory requires madvise REMOVE */ need_madvise = (rb->page_size == qemu_host_page_size); need_fallocate = rb->fd != -1; if (need_fallocate) { /* For a file, this causes the area of the file to be zero'd * if read, and for hugetlbfs also causes it to be unmapped * so a userfault will trigger. */ #ifdef CONFIG_FALLOCATE_PUNCH_HOLE ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, start, length); if (ret) { ret = -errno; error_report("ram_block_discard_range: Failed to fallocate " "%s:%" PRIx64 " +%zx (%d)", rb->idstr, start, length, ret); goto err; } #else ret = -ENOSYS; error_report("ram_block_discard_range: fallocate not available/file" "%s:%" PRIx64 " +%zx (%d)", rb->idstr, start, length, ret); goto err; #endif } if (need_madvise) { /* For normal RAM this causes it to be unmapped, * for shared memory it causes the local mapping to disappear * and to fall back on the file contents (which we just * fallocate'd away). */ #if defined(CONFIG_MADVISE) if (qemu_ram_is_shared(rb) && rb->fd < 0) { ret = madvise(host_startaddr, length, QEMU_MADV_REMOVE); } else { ret = madvise(host_startaddr, length, QEMU_MADV_DONTNEED); } if (ret) { ret = -errno; error_report("ram_block_discard_range: Failed to discard range " "%s:%" PRIx64 " +%zx (%d)", rb->idstr, start, length, ret); goto err; } #else ... -- Thanks, David / dhildenb