On Thu, Sep 29, 2022 at 09:29:54PM -0400, Peter Xu wrote: > Hi, Mike, > > On Thu, Sep 29, 2022 at 04:33:53PM -0700, Mike Kravetz wrote: > > I added some TLB flushing to hugetlb_mcopy_atomic_pte, but it did not make > > any difference. Suggestions would be appreciated as cache/tlb/??? flushing > > issues take me a while to figure out. > > It seems the UFFDIO_COPY for hugetlb is the major issue here, in that for > private mappings we don't inject the page cache. > > I think it makes sense in that e.g. we don't want to allow a private > mapping to be able to write to the page cache. But afaict that's not > correct because UFFDIO_COPY resolves exactly page faults in page cache > layer for file backed memories. So what we should do is inject page cache > but mark the page RO, waiting for a coming CoW if needed. > > I'll attach one patch fix that will start to inject the page into page > cache for UFFDIO_COPY+hugetlb even if mapping is private. Another test > patch is also added because otherwise the private hugetlb selftest won't > work after the fix applied - in the selftest we used to use DONTNEED to > drop the private mapping, but IMHO that's not enough, we need to drop the > page cache too (after the fix). I've also have the test patch attached. > > Feel free to try out with the two patches applied. It started to work for > me for current issue. > > I didn't yet post them out yet because after I applied the two patches I > found other issues - the reserved pages are messed up and leaked. I'll > keep looking tomorrow on the leak issue, but please also let me know if you > figured anything suspecious as I know you're definitely must more fluent on > the reservation code. > > And that's not the only issue I found - shmem can have other issues > regarding private mappings; shmem does it right on the page cache insertion > but not the rest I think.. I'll look into them one by one. It's quite > interesting to dig multiple things out of the write check symptons.. The patches.. -- Peter Xu
>From 1255fabde4f950bef4751ef337791b93f5373a1f Mon Sep 17 00:00:00 2001 From: Peter Xu <peterx@xxxxxxxxxx> Date: Wed, 28 Sep 2022 17:44:00 -0400 Subject: [PATCH 1/2] mm/hugetlb: Insert page cache for UFFDIO_COPY even if private Content-type: text/plain UFFDIO_COPY resolves page fault in page cache layer for file backed memories on shmem and hugetlbfs. It also means for each UFFDIO_COPY we should inject the new page into page cache no matter whether it's private or shared mappings. We used to not do that probably because for private mappings we should not allow the page cache be written for the private mapped process. However it can be done by removing the write bit (as what this patch does) so that CoW will trigger properly for the page cache. Leaving the page cache empty could lead to below sequence: (1) map hugetlb privately, register with uffd missing+wp (2) read page, trigger MISSING event with READ (3) UFFDIO_COPY(wp=1) resolve page fault, keep wr-protected (4) write page, trigger MISSING event again (because page cache missing!) with WRITE This behavior existed since the initial commit of hugetlb MISSING mode support, which is commit 1c9e8def43a3 ("userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings", 2017-02-22). In most cases it should be fine as long as the hugetlb page/pte will be stable (e.g., no wr-protect, no MADV_DONTNEED, ...). However for any reason if a further page fault is triggered, it could cause issue. Recently due to the newly introduced uffd-wp on hugetlbfs and also a recent locking rework from Mike, we can easily fail userfaultfd kselftest with hugetlb private mappings. One further step is we can do early CoW if the private mapping is writable, but let's leave that for later. Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx> Cc: Mike Rapoport <rppt@xxxxxxxxxxxxxxxxxx> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> Signed-off-by: Peter Xu <peterx@xxxxxxxxxx> --- mm/hugetlb.c | 28 ++++++++-------------------- 1 file changed, 8 insertions(+), 20 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 9679fe519b90..a43fc6852f27 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5933,14 +5933,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, int ret = -ENOMEM; struct page *page; int writable; - bool page_in_pagecache = false; if (is_continue) { ret = -EFAULT; page = find_lock_page(mapping, idx); if (!page) goto out; - page_in_pagecache = true; } else if (!*pagep) { /* If a page already exists, then it's UFFDIO_COPY for * a non-missing case. Return -EEXIST. @@ -6014,8 +6012,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, */ __SetPageUptodate(page); - /* Add shared, newly allocated pages to the page cache. */ - if (vm_shared && !is_continue) { + /* Add newly allocated pages to the page cache for UFFDIO_COPY. */ + if (!is_continue) { size = i_size_read(mapping->host) >> huge_page_shift(h); ret = -EFAULT; if (idx >= size) @@ -6030,7 +6028,6 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, ret = hugetlb_add_to_page_cache(page, mapping, idx); if (ret) goto out_release_nounlock; - page_in_pagecache = true; } ptl = huge_pte_lock(h, dst_mm, dst_pte); @@ -6044,18 +6041,13 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, if (!huge_pte_none_mostly(huge_ptep_get(dst_pte))) goto out_release_unlock; - if (page_in_pagecache) { - page_dup_file_rmap(page, true); - } else { - ClearHPageRestoreReserve(page); - hugepage_add_new_anon_rmap(page, dst_vma, dst_addr); - } + page_dup_file_rmap(page, true); /* - * For either: (1) CONTINUE on a non-shared VMA, or (2) UFFDIO_COPY - * with wp flag set, don't set pte write bit. + * For either: (1) a non-shared VMA, or (2) UFFDIO_COPY with wp + * flag set, don't set pte write bit. */ - if (wp_copy || (is_continue && !vm_shared)) + if (wp_copy || !vm_shared) writable = 0; else writable = dst_vma->vm_flags & VM_WRITE; @@ -6083,18 +6075,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, spin_unlock(ptl); if (!is_continue) SetHPageMigratable(page); - if (vm_shared || is_continue) - unlock_page(page); + unlock_page(page); ret = 0; out: return ret; out_release_unlock: spin_unlock(ptl); - if (vm_shared || is_continue) - unlock_page(page); + unlock_page(page); out_release_nounlock: - if (!page_in_pagecache) - restore_reserve_on_error(h, dst_vma, dst_addr, page); put_page(page); goto out; } -- 2.32.0
>From 046ba2d1f5a74d86c6546a4f1bc8081f7c0fd3f8 Mon Sep 17 00:00:00 2001 From: Peter Xu <peterx@xxxxxxxxxx> Date: Thu, 29 Sep 2022 11:55:26 -0400 Subject: [PATCH 2/2] selftests/vm: Use memfd for hugetlb tests Content-type: text/plain We already used memfd for shmem test, move it forward with hugetlb too so that we don't need user to specify the hugetlb file path explicitly when running hugetlb shared tests. More importantly, this patch will start to correctly release hugetlb pages using fallocate(FALLOC_FL_PUNCH_HOLE) even for private mappings, because for private mappings MADV_DONTNEED may not be enough to test UFFDIO_COPY which only zap the pgtables but not evicting the page caches. Here unfortunately we need to cache the ptr<->offset relationship for hugetlb for using fallocate() by adding mem_fd_[src|dst]_offset, because that's the only way to evict the page cache for private mappings. IOW MADV_REMOVE doesn't work. Signed-off-by: Peter Xu <peterx@xxxxxxxxxx> --- tools/testing/selftests/vm/userfaultfd.c | 98 +++++++++++++----------- 1 file changed, 53 insertions(+), 45 deletions(-) diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 74babdbc02e5..4561e9d443e4 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -93,10 +93,11 @@ static volatile bool test_uffdio_zeropage_eexist = true; static bool test_uffdio_wp = true; /* Whether to test uffd minor faults */ static bool test_uffdio_minor = false; - static bool map_shared; -static int shm_fd; -static int huge_fd; +static int mem_fd; +/* File offset for area_src/area_dst upon mem_fd. Needed for hole punching */ +static off_t mem_fd_src_offset; +static off_t mem_fd_dst_offset; static unsigned long long *count_verify; static int uffd = -1; static int uffd_flags, finished, *pipefd; @@ -247,48 +248,59 @@ static void noop_alias_mapping(__u64 *start, size_t len, unsigned long offset) { } +static off_t ptr_to_offset(char *ptr) +{ + if (ptr == area_src) + return mem_fd_src_offset; + else + return mem_fd_dst_offset; +} + static void hugetlb_release_pages(char *rel_area) { + off_t size = nr_pages * page_size, offset = ptr_to_offset(rel_area); + if (!map_shared) { if (madvise(rel_area, nr_pages * page_size, MADV_DONTNEED)) err("madvise(MADV_DONTNEED) failed"); - } else { - if (madvise(rel_area, nr_pages * page_size, MADV_REMOVE)) - err("madvise(MADV_REMOVE) failed"); - } + + /* Evict page cache explibitly for private mappings */ + if (fallocate(mem_fd, + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + offset, size)) + err("fallocate(FALLOC_FL_PUNCH_HOLE) failed"); + } else { + if (madvise(rel_area, nr_pages * page_size, MADV_REMOVE)) + err("madvise(MADV_REMOVE) failed"); + } } static void hugetlb_allocate_area(void **alloc_area, bool is_src) { + off_t size = nr_pages * page_size; + off_t offset = is_src ? 0 : size; void *area_alias = NULL; char **alloc_area_alias; - if (!map_shared) - *alloc_area = mmap(NULL, - nr_pages * page_size, - PROT_READ | PROT_WRITE, - MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | - (is_src ? 0 : MAP_NORESERVE), - -1, - 0); - else - *alloc_area = mmap(NULL, - nr_pages * page_size, - PROT_READ | PROT_WRITE, - MAP_SHARED | - (is_src ? 0 : MAP_NORESERVE), - huge_fd, - is_src ? 0 : nr_pages * page_size); + *alloc_area = mmap(NULL, size, PROT_READ | PROT_WRITE, + (map_shared ? MAP_SHARED : MAP_PRIVATE) | + (is_src ? 0 : MAP_NORESERVE), + mem_fd, offset); if (*alloc_area == MAP_FAILED) err("mmap of hugetlbfs file failed"); + /* + * Only hugetlb needs to cache ptr<->offset because it needs to + * fallocate(FALLOC_FL_PUNCH_HOLE). + */ + if (is_src) + mem_fd_src_offset = offset; + else + mem_fd_dst_offset = offset; + if (map_shared) { - area_alias = mmap(NULL, - nr_pages * page_size, - PROT_READ | PROT_WRITE, - MAP_SHARED, - huge_fd, - is_src ? 0 : nr_pages * page_size); + area_alias = mmap(NULL, size, PROT_READ | PROT_WRITE, + MAP_SHARED, mem_fd, offset); if (area_alias == MAP_FAILED) err("mmap of hugetlb file alias failed"); } @@ -334,14 +346,14 @@ static void shmem_allocate_area(void **alloc_area, bool is_src) } *alloc_area = mmap(p, bytes, PROT_READ | PROT_WRITE, MAP_SHARED, - shm_fd, offset); + mem_fd, offset); if (*alloc_area == MAP_FAILED) err("mmap of memfd failed"); if (test_collapse && *alloc_area != p) err("mmap of memfd failed at %p", p); area_alias = mmap(p_alias, bytes, PROT_READ | PROT_WRITE, MAP_SHARED, - shm_fd, offset); + mem_fd, offset); if (area_alias == MAP_FAILED) err("mmap of memfd alias failed"); if (test_collapse && area_alias != p_alias) @@ -1629,7 +1641,7 @@ static int userfaultfd_stress(void) /* prepare next bounce */ swap(area_src, area_dst); - + swap(mem_fd_src_offset, mem_fd_dst_offset); swap(area_src_alias, area_dst_alias); uffd_stats_report(uffd_stats, nr_cpus); @@ -1821,21 +1833,17 @@ int main(int argc, char **argv) } nr_pages = nr_pages_per_cpu * nr_cpus; - if (test_type == TEST_HUGETLB && map_shared) { - if (argc < 5) - usage(); - huge_fd = open(argv[4], O_CREAT | O_RDWR, 0755); - if (huge_fd < 0) - err("Open of %s failed", argv[4]); - if (ftruncate(huge_fd, 0)) - err("ftruncate %s to size 0 failed", argv[4]); - } else if (test_type == TEST_SHMEM) { - shm_fd = memfd_create(argv[0], 0); - if (shm_fd < 0) + if (test_type == TEST_SHMEM || test_type == TEST_HUGETLB) { + unsigned int memfd_flags = 0; + + if (test_type == TEST_HUGETLB) + memfd_flags = MFD_HUGETLB; + mem_fd = memfd_create(argv[0], memfd_flags); + if (mem_fd < 0) err("memfd_create"); - if (ftruncate(shm_fd, nr_pages * page_size * 2)) + if (ftruncate(mem_fd, nr_pages * page_size * 2)) err("ftruncate"); - if (fallocate(shm_fd, + if (fallocate(mem_fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, nr_pages * page_size * 2)) err("fallocate"); -- 2.32.0