Re: [PATCH] mm/hugetlb: expand restore_reserve_on_error functionality

Mina Almasry <almasrymina@xxxxxxxxxx> · Wed, 2 Jun 2021 17:38:51 -0700

On Wed, Jun 2, 2021 at 4:50 PM Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote:
>
> The routine restore_reserve_on_error is called to restore reservation
> information when an error occurs after page allocation.  The routine
> alloc_huge_page modifies the mapping reserve map and potentially the
> reserve count during allocation.  If code calling alloc_huge_page
> encounters an error after allocation and needs to free the page, the
> reservation information needs to be adjusted.
>
> Currently, restore_reserve_on_error only takes action on pages for which
> the reserve count was adjusted(HPageRestoreReserve flag).  There is
> nothing wrong with these adjustments.  However, alloc_huge_page ALWAYS
> modifies the reserve map during allocation even if the reserve count is
> not adjusted.  This can cause issues as observed during development of
> this patch [1].
>
> One specific series of operations causing an issue is:
> - Create a shared hugetlb mapping
>   Reservations for all pages created by default
> - Fault in a page in the mapping
>   Reservation exists so reservation count is decremented
> - Punch a hole in the file/mapping at index previously faulted
>   Reservation and any associated pages will be removed
> - Allocate a page to fill the hole
>   No reservation entry, so reserve count unmodified
>   Reservation entry added to map by alloc_huge_page
> - Error after allocation and before instantiating the page
>   Reservation entry remains in map
> - Allocate a page to fill the hole
>   Reservation entry exists, so decrement reservation count
> This will cause a reservation count underflow as the reservation count
> was decremented twice for the same index.
>
> A user would observe a very large number for HugePages_Rsvd in
> /proc/meminfo.  This would also likely cause subsequent allocations of
> hugetlb pages to fail as it would 'appear' that all pages are reserved.
>
> This sequence of operations is unlikely to happen, however they were
> easily reproduced and observed using hacked up code as described in [1].
>
> Address the issue by having the routine restore_reserve_on_error take
> action on pages where HPageRestoreReserve is not set.  In this case, we
> need to remove any reserve map entry created by alloc_huge_page.  A new
> helper routine vma_del_reservation assists with this operation.
>
> There are three callers of alloc_huge_page which do not currently call
> restore_reserve_on error before freeing a page on error paths.  Add
> those missing calls.
>
> [1] https://lore.kernel.org/linux-mm/20210528005029.88088-1-almasrymina@xxxxxxxxxx/
> Fixes: 96b96a96ddee ("mm/hugetlb: fix huge page reservation leak in private mapping error paths"
> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> Cc: <stable@xxxxxxxxxxxxxxx>
> ---
>  fs/hugetlbfs/inode.c    |   1 +
>  include/linux/hugetlb.h |   2 +
>  mm/hugetlb.c            | 120 ++++++++++++++++++++++++++++++++--------
>  3 files changed, 100 insertions(+), 23 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 21f7a5c92e8a..926eeb9bf4eb 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -735,6 +735,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>                 __SetPageUptodate(page);
>                 error = huge_add_to_page_cache(page, mapping, index);
>                 if (unlikely(error)) {
> +                       restore_reserve_on_error(h, &pseudo_vma, addr, page);
>                         put_page(page);
>                         mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>                         goto out;
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 44721df20e89..b375b31f60c4 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -627,6 +627,8 @@ struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
>                                 unsigned long address);
>  int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
>                         pgoff_t idx);
> +void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
> +                               unsigned long address, struct page *page);
>
>  /* arch callback */
>  int __init __alloc_bootmem_huge_page(struct hstate *h);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 9a616b7a8563..36b691c3eabf 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2259,12 +2259,18 @@ static void return_unused_surplus_pages(struct hstate *h,
>   * be restored when a newly allocated huge page must be freed.  It is
>   * to be called after calling vma_needs_reservation to determine if a
>   * reservation exists.
> + *
> + * vma_del_reservation is used in error paths where an entry in the reserve
> + * map was created during huge page allocation and must be removed.  It is to
> + * be called after calling vma_needs_reservation to determine if a reservation
> + * exists.
>   */
>  enum vma_resv_mode {
>         VMA_NEEDS_RESV,
>         VMA_COMMIT_RESV,
>         VMA_END_RESV,
>         VMA_ADD_RESV,
> +       VMA_DEL_RESV,
>  };
>  static long __vma_reservation_common(struct hstate *h,
>                                 struct vm_area_struct *vma, unsigned long addr,
> @@ -2308,11 +2314,21 @@ static long __vma_reservation_common(struct hstate *h,
>                         ret = region_del(resv, idx, idx + 1);
>                 }
>                 break;
> +       case VMA_DEL_RESV:
> +               if (vma->vm_flags & VM_MAYSHARE) {
> +                       region_abort(resv, idx, idx + 1, 1);
> +                       ret = region_del(resv, idx, idx + 1);
> +               } else {
> +                       ret = region_add(resv, idx, idx + 1, 1, NULL, NULL);
> +                       /* region_add calls of range 1 should never fail. */
> +                       VM_BUG_ON(ret < 0);
> +               }
> +               break;

I haven't tested, but at first glance I don't think this quite works,
no? The thing is that alloc_huge_page() does a
vma_commit_reservation() which does add_region() regardless if
vm_flags & VM_MAYSHARE or not, so to unroll that we need a function
that does a region_del() regardless if vm_flags & VM_MAYSHARE or not.
I wonder if the root of the bug is the unconditional region_add()
vma_commit_reservation() does.

Also, I believe hugetlb_unreserve_pages() calls region_del() directly,
and doesn't go through the vma_*_reservation() interface. If you're
adding a delete function it may be nice to convert that to use the new
function as well.

I'm happy to take this fix over and submit a v2, since I think I
understand the problem and can readily reproduce the issue (I just add
the warning and some prints and run the userfaultfd tests).

>         default:
>                 BUG();
>         }
>
> -       if (vma->vm_flags & VM_MAYSHARE)
> +       if (vma->vm_flags & VM_MAYSHARE || mode == VMA_DEL_RESV)
>                 return ret;
>         /*
>          * We know private mapping must have HPAGE_RESV_OWNER set.
> @@ -2360,25 +2376,39 @@ static long vma_add_reservation(struct hstate *h,
>         return __vma_reservation_common(h, vma, addr, VMA_ADD_RESV);
>  }
>
> +static long vma_del_reservation(struct hstate *h,
> +                       struct vm_area_struct *vma, unsigned long addr)
> +{
> +       return __vma_reservation_common(h, vma, addr, VMA_DEL_RESV);
> +}
> +
>  /*
> - * This routine is called to restore a reservation on error paths.  In the
> - * specific error paths, a huge page was allocated (via alloc_huge_page)
> - * and is about to be freed.  If a reservation for the page existed,
> - * alloc_huge_page would have consumed the reservation and set
> - * HPageRestoreReserve in the newly allocated page.  When the page is freed
> - * via free_huge_page, the global reservation count will be incremented if
> - * HPageRestoreReserve is set.  However, free_huge_page can not adjust the
> - * reserve map.  Adjust the reserve map here to be consistent with global
> - * reserve count adjustments to be made by free_huge_page.
> + * This routine is called to restore reservation information on error paths.
> + * It should ONLY be called for pages allocated via alloc_huge_page(), and
> + * the hugetlb mutex should remain held when calling this routine.
> + *
> + * It handles two specific cases:
> + * 1) A reservation was in place and page consumed the reservation.
> + *    HPageRestoreRsvCnt is set in the page.
> + * 2) No reservation was in place for the page, so HPageRestoreRsvCnt is
> + *    not set.  However, alloc_huge_page always updates the reserve map.
> + *
> + * In case 1, free_huge_page will increment the global reserve count.  But,
> + * free_huge_page does not have enough context to adjust the reservation map.
> + * This case deals primarily with private mappings.  Adjust the reserve map
> + * here to be consistent with global reserve count adjustments to be made
> + * by free_huge_page.  Make sure the reserve map indicates there is a
> + * reservation present.
> + *
> + * In case 2, simply undo reserve map modifications done by alloc_huge_page.
>   */
> -static void restore_reserve_on_error(struct hstate *h,
> -                       struct vm_area_struct *vma, unsigned long address,
> -                       struct page *page)
> +void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
> +                       unsigned long address, struct page *page)
>  {
> -       if (unlikely(HPageRestoreReserve(page))) {
> -               long rc = vma_needs_reservation(h, vma, address);
> +       long rc = vma_needs_reservation(h, vma, address);
>
> -               if (unlikely(rc < 0)) {
> +       if (HPageRestoreReserve(page)) {
> +               if (unlikely(rc < 0))
>                         /*
>                          * Rare out of memory condition in reserve map
>                          * manipulation.  Clear HPageRestoreReserve so that
> @@ -2391,16 +2421,57 @@ static void restore_reserve_on_error(struct hstate *h,
>                          * accounting of reserve counts.
>                          */
>                         ClearHPageRestoreReserve(page);
> -               } else if (rc) {
> -                       rc = vma_add_reservation(h, vma, address);
> -                       if (unlikely(rc < 0))
> +               else if (rc)
> +                       (void)vma_add_reservation(h, vma, address);
> +               else
> +                       vma_end_reservation(h, vma, address);
> +       } else {
> +               if (!rc) {
> +                       /*
> +                        * This indicates there is an entry in the reserve map
> +                        * added by alloc_huge_page.  We know it was added
> +                        * before the alloc_huge_page call, otherwise
> +                        * HPageRestoreReserve would be set on the page.
> +                        * Remove the entry so that a subsequent allocation
> +                        * does not consume a reservation.
> +                        */
> +                       rc = vma_del_reservation(h, vma, address);
> +                       if (rc < 0)
>                                 /*
> -                                * See above comment about rare out of
> -                                * memory condition.
> +                                * VERY rare out of memory condition.  Since
> +                                * we can not delete the entry, set
> +                                * HPageRestoreReserve so that the reserve
> +                                * count will be incremented when the page
> +                                * is freed.  This reserve will be consumed
> +                                * on a subsequent allocation.
>                                  */
> -                               ClearHPageRestoreReserve(page);
> +                               SetHPageRestoreReserve(page);
> +               } else if (rc < 0) {
> +                       /*
> +                        * Rare out of memory condition from
> +                        * vma_needs_reservation call.  Memory allocation is
> +                        * only attempted if a new entry is needed.  Therefore,
> +                        * this implies there is not an entry in the
> +                        * reserve map.
> +                        *
> +                        * For shared mappings, no entry in the map indicates
> +                        * no reservation.  We are done.
> +                        */
> +                       if (!(vma->vm_flags & VM_MAYSHARE))
> +                               /*
> +                                * For private mappings, no entry indicates
> +                                * a reservation is present.  Since we can
> +                                * not add an entry, set SetHPageRestoreReserve
> +                                * on the page so reserve count will be
> +                                * incremented when freed.  This reserve will
> +                                * be consumed on a subsequent allocation.
> +                                */
> +                               SetHPageRestoreReserve(page);
>                 } else
> -                       vma_end_reservation(h, vma, address);
> +                       /*
> +                        * No reservation present, do nothing
> +                        */
> +                        vma_end_reservation(h, vma, address);
>         }
>  }
>
> @@ -4176,6 +4247,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>                                 spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
>                                 entry = huge_ptep_get(src_pte);
>                                 if (!pte_same(src_pte_old, entry)) {
> +                                       restore_reserve_on_error(h, vma, addr,
> +                                                               new);
>                                         put_page(new);
>                                         /* dst_entry won't change as in child */
>                                         goto again;
> @@ -5174,6 +5247,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>         if (vm_shared || is_continue)
>                 unlock_page(page);
>  out_release_nounlock:
> +       restore_reserve_on_error(h, dst_vma, dst_addr, page);
>         put_page(page);
>         goto out;
>  }
> --
> 2.31.1
>