On Thu, Jan 27, 2022 at 3:57 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 13.01.22 19:03, Mike Kravetz wrote: > > Userfaultfd selftests for hugetlb does not perform UFFD_EVENT_REMAP > > testing. However, mremap support was recently added in commit > > 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed > > vma"). While attempting to enable mremap support in the test, it was > > discovered that the mremap test indirectly depends on MADV_DONTNEED. > > > > hugetlb does not support MADV_DONTNEED. However, the only thing > > preventing support is a check in can_madv_lru_vma(). Simply removing > > the check will enable support. > > > > This is sent as a RFC because there is no existing use case calling > > for hugetlb MADV_DONTNEED support except possibly the userfaultfd test. > > However, adding support makes sense as it is fairly trivial and brings > > hugetlb functionality more in line with 'normal' memory. > > > > Just a note: > > QEMU doesn't use huge anonymous memory directly (MAP_ANON | MAP_HUGE...) > but instead always goes either via hugetlbfs or via memfd. > > For MAP_PRIVATE hugetlb mappings, fallocate(FALLOC_FL_PUNCH_HOLE) seems > to get the job done (IOW: also discards private anon pages). See the > comments in the QEMU code below. I remember that that is somewhat > inconsistent. For ordinary MAP_PRIVATE mapped files I remember that we > always need fallocate(FALLOC_FL_PUNCH_HOLE) + madvise(QEMU_MADV_DONTNEED) > to make sure > > a) All file pages are removed > b) All private anon pages are removed > > IIRC hugetlbfs really is different in that regard, but maybe other fs > behave similarly. > > That's why QEMU was able to live for now without MADV_DONTNEED support > for hugetlbfs and most probably won't ever need it. Agreed, all of the production use cases I'm aware of use hugetlbfs, not MAP_HUGE... But, I would say this is convenient for testing purposes. It's slightly more convenient to not have to mount hugetlbfs / perform the associated setup for tests. Perhaps that's only a small motivation for enabling this, but then again Mike's patch to do so is likewise very small. :) > > > ... > /* The logic here is messy; > * madvise DONTNEED fails for hugepages > * fallocate works on hugepages and shmem > * shared anonymous memory requires madvise REMOVE > */ > need_madvise = (rb->page_size == qemu_host_page_size); > need_fallocate = rb->fd != -1; > if (need_fallocate) { > /* For a file, this causes the area of the file to be zero'd > * if read, and for hugetlbfs also causes it to be unmapped > * so a userfault will trigger. > */ > #ifdef CONFIG_FALLOCATE_PUNCH_HOLE > ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, > start, length); > if (ret) { > ret = -errno; > error_report("ram_block_discard_range: Failed to fallocate " > "%s:%" PRIx64 " +%zx (%d)", > rb->idstr, start, length, ret); > goto err; > } > #else > ret = -ENOSYS; > error_report("ram_block_discard_range: fallocate not available/file" > "%s:%" PRIx64 " +%zx (%d)", > rb->idstr, start, length, ret); > goto err; > #endif > } > if (need_madvise) { > /* For normal RAM this causes it to be unmapped, > * for shared memory it causes the local mapping to disappear > * and to fall back on the file contents (which we just > * fallocate'd away). > */ > #if defined(CONFIG_MADVISE) > if (qemu_ram_is_shared(rb) && rb->fd < 0) { > ret = madvise(host_startaddr, length, QEMU_MADV_REMOVE); > } else { > ret = madvise(host_startaddr, length, QEMU_MADV_DONTNEED); > } > if (ret) { > ret = -errno; > error_report("ram_block_discard_range: Failed to discard range " > "%s:%" PRIx64 " +%zx (%d)", > rb->idstr, start, length, ret); > goto err; > } > #else > ... > > -- > Thanks, > > David / dhildenb >