Ackerley Tng <ackerleytng@xxxxxxxxxx> writes: > <snip> > > I'll go over the rest of your patches and dig into the meaning of `avoid_reserve`. Yes, after looking into this more deeply, I agree that avoid_reserve means avoiding the reservations in the resv_map rather than reservations in the subpool or hstate. Here's more detail of what's going on in the reproducer that I wrote as I reviewed Peter's patch: 1. On fallocate(), allocate page A 2. On mmap(), set up a vma without VM_MAYSHARE since MAP_PRIVATE was requested 3. On faulting *buf = 1, allocate a new page B, copy A to B because the mmap request was MAP_PRIVATE 4. On fork, prep for COW by marking page as read only. Both parent and child share B. 5. On faulting *buf = 2 (write fault), allocate page C, copy B to C + B belongs to the child, C belongs to the parent + C is owned by the parent 6. Child exits, B is freed 7. On munmap(), C is freed 8. On unlink(), A is freed When C was allocated in the parent (owns MAP_PRIVATE page, doing a copy on write), spool->rsv_hpages was decreased but h->resv_huge_pages was not. This is the root of the bug. We should decrement h->resv_huge_pages if a reserved page from the subpool was used, instead of whether avoid_reserve or vma_has_reserves() is set. If avoid_reserve is set, the subpool shouldn't be checked for a reservation, so we won't be decrementing h->resv_huge_pages anyway. I agree with Peter's fix as a whole (the entire patch series). Reviewed-by: Ackerley Tng <ackerleytng@xxxxxxxxxx> Tested-by: Ackerley Tng <ackerleytng@xxxxxxxxxx> --- Some definitions which might be helpful: + h->resv_huge_pages indicates number of reserved pages globally. + This number increases when pages are reserved + This number decreases when reserved pages are allocated, or when pages are unreserved + spool->rsv_hpages indicates number of reserved pages in this subpool. + This number increases when pages are reserved + This number decreases when reserved pages are allocated, or when pages are unreserved + h->resv_huge_pages should be the sum of all subpools' spool->rsv_hpages. More details on the flow in alloc_hugetlb_folio() which might be helpful: hugepage_subpool_get_pages() returns "the number of pages by which the global pools must be adjusted (upward)". This return value is never negative other than errors. (hugepage_subpool_get_pages() always gets called with a positive delta). Specifically in alloc_hugetlb_folio(), the return value is either 0 or 1 (other than errors). If the return value is 0, the subpool had enough reservations and so we should decrement h->resv_huge_pages. If the return value is 1, it means that this subpool did not have any more reserved hugepages, and we need to get a page from the global hstate. dequeue_hugetlb_folio_vma() will get us a page that was already allocated. In dequeue_hugetlb_folio_vma(), if the vma doesn't have enough reserves for 1 page, and there are no available_huge_pages() left, we quit dequeueing since we will need to allocate a new page. If we want to avoid_reserve, that means we don't want to use the vma's reserves in resv_map, we also check available_huge_pages(). If there are available_huge_pages(), we go on to dequeue a page. Then, we determine whether to decrement h->resv_huge_pages. We should decrement if a reserved page from the subpool was used, instead of whether avoid_reserve or vma_has_reserves() is set. In the case where a surplus page needs to be allocated, the surplus page isn't and doesn't need to be associated with a subpool, so no subpool hugepage number tracking updates are required. h->resv_huge_pages still has to be updated... is this where h->resv_huge_pages can go negative?