Re: [RFC PATCH] mm/hugetlb_vmemmap: fix race with speculative PFN walkers

Yu Zhao <yuzhao@xxxxxxxxxx> · Wed, 26 Jun 2024 22:47:39 -0600

On Tue, Jun 25, 2024 at 8:38 PM Muchun Song <muchun.song@xxxxxxxxx> wrote:
>
>
>
> On 2024/6/22 05:37, Yu Zhao wrote:
> > While investigating HVO for THPs [1], it turns out that speculative
> > PFN walkers like compaction can race with vmemmap modificatioins,
> > e.g.,
> >
> >    CPU 1 (vmemmap modifier)         CPU 2 (speculative PFN walker)
> >    -----------------------------    ------------------------------
> >    Allocates an LRU folio page1
> >                                     Sees page1
> >    Frees page1
> >
> >    Allocates a hugeTLB folio page2
> >    (page1 being a tail of page2)
> >
> >    Updates vmemmap mapping page1
> >                                     get_page_unless_zero(page1)
> >
> > Even though page1 has a zero refcnt after HVO, get_page_unless_zero()
> > can still try to modify its read-only struct page resulting in a
> > crash.
> >
> > An independent report [2] confirmed this race.
>
> Right. Thanks for your continuous focus on this race.
>
> >
> > There are two discussed approaches to fix this race:
> > 1. Make RO vmemmap RW so that get_page_unless_zero() can fail without
> >     triggering a PF.
> > 2. Use RCU to make sure get_page_unless_zero() either sees zero
> >     refcnts through the old vmemmap or non-zero refcnts through the new
> >     one.
> >
> > The second approach is preferred here because:
> > 1. It can prevent illegal modifications to struct page[] that is HVO;
> > 2. It can be generalized, in a way similar to ZERO_PAGE(), to fix
> >     similar races in other places, e.g., arch_remove_memory() on x86
> >     [3], which frees vmemmap mapping offlined struct page[].
> >
> > While adding synchronize_rcu(), the goal is to be surgical, rather
> > than optimized. Specifically, calls to synchronize_rcu() on the error
> > handling paths can be coalesced, but it is not done for the sake of
> > Simplicity: noticeably, this fix removes ~50% more lines than it adds.
> >
> > [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@xxxxxxxxxx/
> > [2] https://lore.kernel.org/917FFC7F-0615-44DD-90EE-9F85F8EA9974@xxxxxxxxx/
> > [3] https://lore.kernel.org/be130a96-a27e-4240-ad78-776802f57cad@xxxxxxxxxx/
> >
> > Signed-off-by: Yu Zhao <yuzhao@xxxxxxxxxx>
> > ---
> >   include/linux/page_ref.h |  8 ++++++-
> >   mm/hugetlb.c             | 50 +++++-----------------------------------
> >   mm/hugetlb_vmemmap.c     | 16 +++++++++++++
> >   3 files changed, 29 insertions(+), 45 deletions(-)
> >
> > diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
> > index 1acf5bac7f50..add92e8f31b2 100644
> > --- a/include/linux/page_ref.h
> > +++ b/include/linux/page_ref.h
> > @@ -230,7 +230,13 @@ static inline int folio_ref_dec_return(struct folio *folio)
> >
> >   static inline bool page_ref_add_unless(struct page *page, int nr, int u)
> >   {
> > -     bool ret = atomic_add_unless(&page->_refcount, nr, u);
> > +     bool ret = false;
> > +
> > +     rcu_read_lock();
> > +     /* avoid writing to the vmemmap area being remapped */
> > +     if (!page_is_fake_head(page) && page_ref_count(page) != u)
> > +             ret = atomic_add_unless(&page->_refcount, nr, u);
> > +     rcu_read_unlock();
> >
> >       if (page_ref_tracepoint_active(page_ref_mod_unless))
> >               __page_ref_mod_unless(page, nr, ret);
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index f35abff8be60..271d83a7cde0 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1629,9 +1629,8 @@ static inline void destroy_compound_gigantic_folio(struct folio *folio,
> >    *
> >    * Must be called with hugetlb lock held.
> >    */
> > -static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio,
> > -                                                     bool adjust_surplus,
> > -                                                     bool demote)
> > +static void remove_hugetlb_folio(struct hstate *h, struct folio *folio,
> > +                                                     bool adjust_surplus)
> >   {
> >       int nid = folio_nid(folio);
> >
> > @@ -1661,33 +1660,13 @@ static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio,
> >       if (!folio_test_hugetlb_vmemmap_optimized(folio))
> >               __folio_clear_hugetlb(folio);
> >
> > -      /*
> > -       * In the case of demote we do not ref count the page as it will soon
> > -       * be turned into a page of smaller size.
> > -      */
> > -     if (!demote)
> > -             folio_ref_unfreeze(folio, 1);
> > -
> >       h->nr_huge_pages--;
> >       h->nr_huge_pages_node[nid]--;
> >   }
> >
> > -static void remove_hugetlb_folio(struct hstate *h, struct folio *folio,
> > -                                                     bool adjust_surplus)
> > -{
> > -     __remove_hugetlb_folio(h, folio, adjust_surplus, false);
> > -}
> > -
> > -static void remove_hugetlb_folio_for_demote(struct hstate *h, struct folio *folio,
> > -                                                     bool adjust_surplus)
> > -{
> > -     __remove_hugetlb_folio(h, folio, adjust_surplus, true);
> > -}
> > -
> >   static void add_hugetlb_folio(struct hstate *h, struct folio *folio,
> >                            bool adjust_surplus)
> >   {
> > -     int zeroed;
> >       int nid = folio_nid(folio);
> >
> >       VM_BUG_ON_FOLIO(!folio_test_hugetlb_vmemmap_optimized(folio), folio);
> > @@ -1711,21 +1690,6 @@ static void add_hugetlb_folio(struct hstate *h, struct folio *folio,
> >        */
> >       folio_set_hugetlb_vmemmap_optimized(folio);
> >
> > -     /*
> > -      * This folio is about to be managed by the hugetlb allocator and
> > -      * should have no users.  Drop our reference, and check for others
> > -      * just in case.
> > -      */
> > -     zeroed = folio_put_testzero(folio);
> > -     if (unlikely(!zeroed))
> > -             /*
> > -              * It is VERY unlikely soneone else has taken a ref
> > -              * on the folio.  In this case, we simply return as
> > -              * free_huge_folio() will be called when this other ref
> > -              * is dropped.
> > -              */
> > -             return;
> > -
> >       arch_clear_hugetlb_flags(folio);
> >       enqueue_hugetlb_folio(h, folio);
> >   }
> > @@ -1779,6 +1743,8 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
> >               spin_unlock_irq(&hugetlb_lock);
> >       }
> >
> > +     folio_ref_unfreeze(folio, 1);
> > +
> >       /*
> >        * Non-gigantic pages demoted from CMA allocated gigantic pages
> >        * need to be given back to CMA in free_gigantic_folio.
> > @@ -3079,11 +3045,8 @@ static int alloc_and_dissolve_hugetlb_folio(struct hstate *h,
> >
> >   free_new:
> >       spin_unlock_irq(&hugetlb_lock);
> > -     if (new_folio) {
> > -             /* Folio has a zero ref count, but needs a ref to be freed */
> > -             folio_ref_unfreeze(new_folio, 1);
> > +     if (new_folio)
> >               update_and_free_hugetlb_folio(h, new_folio, false);
> > -     }
>
> Look into this function, we have:
>
> dissolve_free_huge_page
> retry:
>      if (!folio_test_hugetlb(folio))
>          return;
>      if (!folio_ref_count(folio))
>          if (unlikely(!folio_test_hugetlb_freed(folio)))
>              goto retry;
>      remove_hugetlb_folio(h, folio, false);
>
> Since you have not raised the refcount in remove_hugetlb_folio(), we will
> disslove this page again if there is a concurrent dissolve_free_huge_page()
> processing routine. Then, the statistics will be wrong (like
> ->nr_huge_pages).

Thanks for pointing this out!

> A solution seems easy, we should clear folio_clear_hugetlb_freed in
> remove_hugetlb_folio.

Agreed.