On 6/2/21 5:38 PM, Mina Almasry wrote: > On Wed, Jun 2, 2021 at 4:50 PM Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote: >> >> The routine restore_reserve_on_error is called to restore reservation >> information when an error occurs after page allocation. The routine >> alloc_huge_page modifies the mapping reserve map and potentially the >> reserve count during allocation. If code calling alloc_huge_page >> encounters an error after allocation and needs to free the page, the >> reservation information needs to be adjusted. >> >> Currently, restore_reserve_on_error only takes action on pages for which >> the reserve count was adjusted(HPageRestoreReserve flag). There is >> nothing wrong with these adjustments. However, alloc_huge_page ALWAYS >> modifies the reserve map during allocation even if the reserve count is >> not adjusted. This can cause issues as observed during development of >> this patch [1]. >> >> One specific series of operations causing an issue is: >> - Create a shared hugetlb mapping >> Reservations for all pages created by default >> - Fault in a page in the mapping >> Reservation exists so reservation count is decremented >> - Punch a hole in the file/mapping at index previously faulted >> Reservation and any associated pages will be removed >> - Allocate a page to fill the hole >> No reservation entry, so reserve count unmodified >> Reservation entry added to map by alloc_huge_page >> - Error after allocation and before instantiating the page >> Reservation entry remains in map >> - Allocate a page to fill the hole >> Reservation entry exists, so decrement reservation count >> This will cause a reservation count underflow as the reservation count >> was decremented twice for the same index. >> >> A user would observe a very large number for HugePages_Rsvd in >> /proc/meminfo. This would also likely cause subsequent allocations of >> hugetlb pages to fail as it would 'appear' that all pages are reserved. >> >> This sequence of operations is unlikely to happen, however they were >> easily reproduced and observed using hacked up code as described in [1]. >> >> Address the issue by having the routine restore_reserve_on_error take >> action on pages where HPageRestoreReserve is not set. In this case, we >> need to remove any reserve map entry created by alloc_huge_page. A new >> helper routine vma_del_reservation assists with this operation. >> >> There are three callers of alloc_huge_page which do not currently call >> restore_reserve_on error before freeing a page on error paths. Add >> those missing calls. >> >> [1] https://lore.kernel.org/linux-mm/20210528005029.88088-1-almasrymina@xxxxxxxxxx/ >> Fixes: 96b96a96ddee ("mm/hugetlb: fix huge page reservation leak in private mapping error paths" >> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> >> Cc: <stable@xxxxxxxxxxxxxxx> >> --- >> fs/hugetlbfs/inode.c | 1 + >> include/linux/hugetlb.h | 2 + >> mm/hugetlb.c | 120 ++++++++++++++++++++++++++++++++-------- >> 3 files changed, 100 insertions(+), 23 deletions(-) >> >> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c >> index 21f7a5c92e8a..926eeb9bf4eb 100644 >> --- a/fs/hugetlbfs/inode.c >> +++ b/fs/hugetlbfs/inode.c >> @@ -735,6 +735,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, >> __SetPageUptodate(page); >> error = huge_add_to_page_cache(page, mapping, index); >> if (unlikely(error)) { >> + restore_reserve_on_error(h, &pseudo_vma, addr, page); >> put_page(page); >> mutex_unlock(&hugetlb_fault_mutex_table[hash]); >> goto out; >> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h >> index 44721df20e89..b375b31f60c4 100644 >> --- a/include/linux/hugetlb.h >> +++ b/include/linux/hugetlb.h >> @@ -627,6 +627,8 @@ struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma, >> unsigned long address); >> int huge_add_to_page_cache(struct page *page, struct address_space *mapping, >> pgoff_t idx); >> +void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma, >> + unsigned long address, struct page *page); >> >> /* arch callback */ >> int __init __alloc_bootmem_huge_page(struct hstate *h); >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> index 9a616b7a8563..36b691c3eabf 100644 >> --- a/mm/hugetlb.c >> +++ b/mm/hugetlb.c >> @@ -2259,12 +2259,18 @@ static void return_unused_surplus_pages(struct hstate *h, >> * be restored when a newly allocated huge page must be freed. It is >> * to be called after calling vma_needs_reservation to determine if a >> * reservation exists. >> + * >> + * vma_del_reservation is used in error paths where an entry in the reserve >> + * map was created during huge page allocation and must be removed. It is to >> + * be called after calling vma_needs_reservation to determine if a reservation >> + * exists. >> */ >> enum vma_resv_mode { >> VMA_NEEDS_RESV, >> VMA_COMMIT_RESV, >> VMA_END_RESV, >> VMA_ADD_RESV, >> + VMA_DEL_RESV, >> }; >> static long __vma_reservation_common(struct hstate *h, >> struct vm_area_struct *vma, unsigned long addr, >> @@ -2308,11 +2314,21 @@ static long __vma_reservation_common(struct hstate *h, >> ret = region_del(resv, idx, idx + 1); >> } >> break; >> + case VMA_DEL_RESV: >> + if (vma->vm_flags & VM_MAYSHARE) { >> + region_abort(resv, idx, idx + 1, 1); >> + ret = region_del(resv, idx, idx + 1); >> + } else { >> + ret = region_add(resv, idx, idx + 1, 1, NULL, NULL); >> + /* region_add calls of range 1 should never fail. */ >> + VM_BUG_ON(ret < 0); >> + } >> + break; > > I haven't tested, but at first glance I don't think this quite works, > no? The thing is that alloc_huge_page() does a > vma_commit_reservation() which does add_region() regardless if > vm_flags & VM_MAYSHARE or not, so to unroll that we need a function > that does a region_del() regardless if vm_flags & VM_MAYSHARE or not. > I wonder if the root of the bug is the unconditional region_add() > vma_commit_reservation() does. Do give it a test. I beleive I tested in the same way you tested, but could be mistaken. We may not want to always unroll. vma_del_reservation is only called in the path where HPageRestoreReserve is not set. So, no reservation entry was present in the reserve map before the allocation. alloc_huge_page likely added an entry via vma_commit_reservation. Since there was an error and we will be freeing the page, we need to make sure no reservation exists. As you know, making sure a reservation does not exist is different for shared and private mappings. - For shared mappings, we need to make sure there is not an entry in the reserve map. - For private mappings, we need to make sure there is an entry in the reserve map. That is why vma_del_reservation does a region_del for shared mappings and a region_add for private mappings. > Also, I believe hugetlb_unreserve_pages() calls region_del() directly, > and doesn't go through the vma_*_reservation() interface. If you're > adding a delete function it may be nice to convert that to use the new > function as well. > > I'm happy to take this fix over and submit a v2, since I think I > understand the problem and can readily reproduce the issue (I just add > the warning and some prints and run the userfaultfd tests). Do run your test. I am interested to see if you experience issues. -- Mike Kravetz