On Tue, Jan 07, 2025 at 03:39:56PM -0500, Peter Xu wrote: > Since commit 04f2cbe35699 ("hugetlb: guarantee that COW faults for a > process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed"), > avoid_reserve was introduced for a special case of CoW on hugetlb private > mappings, and only if the owner VMA is trying to allocate yet another > hugetlb folio that is not reserved within the private vma reserved map. > > Later on, in commit d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle areas > hole punched by fallocate"), alloc_huge_page() enforced to not consume any > global reservation as long as avoid_reserve=true. This operation doesn't > look correct, because even if it will enforce the allocation to not use > global reservation at all, it will still try to take one reservation from > the spool (if the subpool existed). Then since the spool reserved pages > take from global reservation, it'll also take one reservation globally. > > Logically it can cause global reservation to go wrong. > > I wrote a reproducer below, trigger this special path, and every run of > such program will cause global reservation count to increment by one, until > it hits the number of free pages: > > #define _GNU_SOURCE /* See feature_test_macros(7) */ > #include <stdio.h> > #include <fcntl.h> > #include <errno.h> > #include <unistd.h> > #include <stdlib.h> > #include <sys/mman.h> > > #define MSIZE (2UL << 20) > > int main(int argc, char *argv[]) > { > const char *path; > int *buf; > int fd, ret; > pid_t child; > > if (argc < 2) { > printf("usage: %s <hugetlb_file>\n", argv[0]); > return -1; > } > > path = argv[1]; > > fd = open(path, O_RDWR | O_CREAT, 0666); > if (fd < 0) { > perror("open failed"); > return -1; > } > > ret = fallocate(fd, 0, 0, MSIZE); > if (ret != 0) { > perror("fallocate"); > return -1; > } > > buf = mmap(NULL, MSIZE, PROT_READ|PROT_WRITE, > MAP_PRIVATE, fd, 0); > > if (buf == MAP_FAILED) { > perror("mmap() failed"); > return -1; > } > > /* Allocate a page */ > *buf = 1; > > child = fork(); > if (child == 0) { > /* child doesn't need to do anything */ > exit(0); > } > > /* Trigger CoW from owner */ > *buf = 2; > > munmap(buf, MSIZE); > close(fd); > unlink(path); > > return 0; > } > > It can only reproduce with a sub-mount when there're reserved pages on the > spool, like: > > # sysctl vm.nr_hugepages=128 > # mkdir ./hugetlb-pool > # mount -t hugetlbfs -o min_size=8M,pagesize=2M none ./hugetlb-pool > > Then run the reproducer on the mountpoint: > > # ./reproducer ./hugetlb-pool/test > > Fix it by taking the reservation from spool if available. In general, > avoid_reserve is IMHO more about "avoid vma resv map", not spool's. > > I copied stable, however I have no intention for backporting if it's not a > clean cherry-pick, because private hugetlb mapping, and then fork() on top > is too rare to hit. > > Cc: linux-stable <stable@xxxxxxxxxxxxxxx> > Fixes: d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate") > Reviewed-by: Ackerley Tng <ackerleytng@xxxxxxxxxx> > Tested-by: Ackerley Tng <ackerleytng@xxxxxxxxxx> > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx> Reviewed-by: Oscar Salvador <osalvador@xxxxxxx> -- Oscar Salvador SUSE Labs