Re: [PATCH v1] mm/hugetlb_vmemmap: remap head page to newly allocated page

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Aug 03, 2022 at 01:22:21PM +0100, Joao Martins wrote:
> 
> 
> On 8/3/22 11:44, Muchun Song wrote:
> > On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote:
> >> On 8/3/22 05:11, Muchun Song wrote:
> >>> On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote:
> >>>> Today with `hugetlb_free_vmemmap=on` the struct page memory that is
> >>>> freed back to page allocator is as following: for a 2M hugetlb page it
> >>>> will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap
> >>>> pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap
> >>>> pages. Essentially, that means that it breaks the first 4K of a
> >>>> potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or
> >>>> 16M (for 1G hugetlb pages). For this reason the memory that it's free
> >>>> back to page allocator cannot be used for hugetlb to allocate huge pages
> >>>> of the same size, but rather only of a smaller huge page size:
> >>>>
> >>>
> >>> Hi Joao,
> >>>
> >>> Thanks for your work on this. I admit you are right. The current mechanism
> >>> prevented the freed vmemmap pages from being mergerd into a potential
> >>> contiguous page. Allocating a new head page is straightforward approach,
> >>> however, it is very dangerous at runtime after system booting up. Why
> >>> dangerous? Because you should first 1) copy the content from the head vmemmap
> >>> page to the targeted (newly allocated) page, and then 2) change the PTE
> >>> entry to the new page. However, the content (especially the refcount) of the
> >>> old head vmemmmap page could be changed elsewhere (e.g. other modules)
> >>> between the step 1) and 2). Eventually, the new allocated vmemmap page is
> >>> corrupted. Unluckily, we don't have an easy approach to prevent it.
> >>>
> >> OK, I see what I missed. You mean the refcount (or any other data) on the
> >> preceeding struct pages to this head struct page unrelated to the hugetlb
> >> page being remapped. Meaning when the first struct page in the old vmemmap
> >> page is *not* the head page we are trying to remap right?
> >>
> >> See further below in your patch but I wonder if we could actually check this
> >> from the hugetlb head va being aligned to PAGE_SIZE. Meaning that we would be checking
> >> that this head page is the first of struct page in the vmemmap page that
> >> we are still safe to remap? If so, the patch could be simpler more
> >> like mine, without the special freeing path you added below.
> >>
> >> If I'm right, see at the end.
> > 
> > I am not sure we are on the same page (it seems that we are not after I saw your
> > below patch). 
> 
> Even though I misunderstood you it might still look like a possible scenario.
> 
> > So let me make it become more clarified.
> > 
> Thanks
> 
> > CPU0:                                            CPU1:
> > 
> > vmemmap_remap_free(start, end, reuse)
> >   // alloc a new page used to be the head vmemmap page
> >   page = alloc_pages_node();
> > 
> >   memcpy(page_address(page), reuse, PAGE_SIZE);
> >                                                  // Now the @reuse address is mapped to the original
> >                                                  // page frame. So the change will be reflected on the
> >                                                  // original page frame.
> >                                                  get_page(reuse);
> >   vmemmap_remap_pte();
> >    // remap to the above new allocated page 
> >    set_pte_at();
> > 
> >   flush_tlb_kernel_range();
> 
> note-to-self: totally missed to change the flush_tlb_kernel_range() to include the full range.
>

Right. I have noticed that as well.

> >                                                  // Now the @reuse address is mapped to the new allocated
> >                                                  // page frame. So the change will be reflected on the
> >                                                  // new page frame and it is corrupted.
> >                                                  put_page(reuse);
> > 
> > So we should make 1) memcpy, 2) remap and 3) TLB flush atomic on CPU0, however
> > it is not easy.
> > 
> OK, I understand what you mean now. However, I am trying to follow if this race is
> possible? Note that given your previous answer above, I am assuming in your race scenario
> that the vmemmap page only ever stores metadata (struct page) related to the hugetlb-page
> currently being remapped. If this assumption is wrong, then your race would be possible
> (but it wouldn't be from a get_page in the reuse_addr)
> 
> So, how would we get into doing a get_page() on the head-page that we are remapping (and
> its put_page() for that matter) from somewhere else ... considering we are at
> prep_new_huge_page() when we call vmemmap_remap_free() and hence ... we already got it
> from page allocator ... but hugetlb hasn't yet finished initializing the head page
> metadata. Hence it isn't yet accounted for someone to grab either e.g. in
> demote/page-fault/migration/etc?
>

As I know, at least there are two places which could get the refcount.
1) GUP and 2) Memoey failure.

For 1), you can refer to the commit 7118fc2906e2925d7edb5ed9c8a57f2a5f23b849.
For 2), I think you can refer to the function of __get_unpoison_page().

Both places could grab an extra refcount to the processing HugeTLB's struct page.

Thanks.

> On the hugetlb freeing path (vmemmap_remap_alloc) there we just keep the already mapped
> head page as is and we will not be allocating a new one.
> 
> Perhaps I am missing something obvious.
> 
> > Thanks.
> > 
> >>
> >>> I also thought of solving this issue, but I didn't find any easy way to
> >>> solve it after system booting up. However, it is possible at system boot
> >>> time. Because if the system is in a very early initialization stage,
> >>> anyone should not access struct page. I have implemented it a mounth ago.
> >>> I didn't send it out since some additional preparation work is required.
> >>> The main preparation is to move the allocation of HugeTLB to a very
> >>> early initialization stage (I want to put it into pure_initcall). Because
> >>> the allocation of HugeTLB is parsed from cmdline whose logic is very
> >>> complex, I have sent a patch to cleanup it [1]. After this cleanup, it
> >>> will be easy to move the allocation to an early stage. Sorry for that
> >>> I am busy lately, I didn't have time to send a new version. But I will
> >>> send it ASAP.
> >>>
> >>> [1] https://lore.kernel.org/all/20220616071827.3480-1-songmuchun@xxxxxxxxxxxxx/
> >>>
> >>> The following diff is the main work to remap the head vmemmap page to
> >>> a new allocated page.
> >>>
> >> I like how you use walk::reuse_addr to tell it's an head page.
> >> I didn't find my version as clean with passing a boolean to the remap_pte().
> >>
> >>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> >>> index 20f414c0379f..71f2d7335e6f 100644
> >>> --- a/mm/hugetlb_vmemmap.c
> >>> +++ b/mm/hugetlb_vmemmap.c
> >>> @@ -15,6 +15,7 @@
> >>>  #include <asm/pgalloc.h>
> >>>  #include <asm/tlbflush.h>
> >>>  #include "hugetlb_vmemmap.h"
> >>> +#include "internal.h"
> >>>
> >>>  /**
> >>>   * struct vmemmap_remap_walk - walk vmemmap page table
> >>> @@ -227,14 +228,37 @@ static inline void free_vmemmap_page(struct page *page)
> >>>  }
> >>>
> >>>  /* Free a list of the vmemmap pages */
> >>> -static void free_vmemmap_page_list(struct list_head *list)
> >>> +static void free_vmemmap_pages(struct list_head *list)
> >>>  {
> >>>         struct page *page, *next;
> >>>
> >>> +       list_for_each_entry_safe(page, next, list, lru)
> >>> +               free_vmemmap_page(page);
> >>> +}
> >>> +
> >>> +/*
> >>> + * Free a list of vmemmap pages but skip per-cpu free list of buddy
> >>> + * allocator.
> >>> + */
> >>> +static void free_vmemmap_pages_nonpcp(struct list_head *list)
> >>> +{
> >>> +       struct zone *zone;
> >>> +       struct page *page, *next;
> >>> +
> >>> +       if (list_empty(list))
> >>> +               return;
> >>> +
> >>> +       zone = page_zone(list_first_entry(list, struct page, lru));
> >>> +       zone_pcp_disable(zone);
> >>>         list_for_each_entry_safe(page, next, list, lru) {
> >>> -               list_del(&page->lru);
> >>> +               if (zone != page_zone(page)) {
> >>> +                       zone_pcp_enable(zone);
> >>> +                       zone = page_zone(page);
> >>> +                       zone_pcp_disable(zone);
> >>> +               }
> >>>                 free_vmemmap_page(page);
> >>>         }
> >>> +       zone_pcp_enable(zone);
> >>>  }
> >>>
> >>>  static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
> >>> @@ -244,12 +268,28 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
> >>>          * Remap the tail pages as read-only to catch illegal write operation
> >>>          * to the tail pages.
> >>>          */
> >>> -       pgprot_t pgprot = PAGE_KERNEL_RO;
> >>> +       pgprot_t pgprot = addr == walk->reuse_addr ? PAGE_KERNEL : PAGE_KERNEL_RO;
> >>>         pte_t entry = mk_pte(walk->reuse_page, pgprot);
> >>>         struct page *page = pte_page(*pte);
> >>>
> >>>         list_add_tail(&page->lru, walk->vmemmap_pages);
> >>>         set_pte_at(&init_mm, addr, pte, entry);
> >>> +
> >>> +       if (unlikely(addr == walk->reuse_addr)) {
> >>> +               void *old = page_to_virt(page);
> >>> +
> >>> +               /* Remove it from the vmemmap_pages list to avoid being freed. */
> >>> +               list_del(&walk->reuse_page->lru);
> >>> +               flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> >>> +               /*
> >>> +                * If we reach here meaning the system is in a very early
> >>> +                * initialization stage, anyone should not access struct page.
> >>> +                * However, if there is something unexpected, the head struct
> >>> +                * page most likely to be written (usually ->_refcount). Using
> >>> +                * BUG_ON() to catch this unexpected case.
> >>> +                */
> >>> +               BUG_ON(memcmp(old, (void *)addr, sizeof(struct page)));
> >>> +       }
> >>>  }
> >>>
> >>>  /*
> >>> @@ -298,7 +338,10 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
> >>>   *             to remap.
> >>>   * @end:       end address of the vmemmap virtual address range that we want to
> >>>   *             remap.
> >>> - * @reuse:     reuse address.
> >>> + * @reuse:     reuse address. If @reuse is equal to @start, it means the page
> >>> + *             frame which @reuse address is mapped to will be replaced with a
> >>> + *             new page frame and the previous page frame will be freed, this
> >>> + *             is to reduce memory fragment.
> >>>   *
> >>>   * Return: %0 on success, negative error code otherwise.
> >>>   */
> >>> @@ -319,14 +362,26 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
> >>>          * (see more details from the vmemmap_pte_range()):
> >>>          *
> >>>          * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE)
> >>> -        *   should be continuous.
> >>> +        *   should be continuous or @start is equal to @reuse.
> >>>          * - The @reuse address is part of the range [@reuse, @end) that we are
> >>>          *   walking which is passed to vmemmap_remap_range().
> >>>          * - The @reuse address is the first in the complete range.
> >>>          *
> >>>          * So we need to make sure that @start and @reuse meet the above rules.
> >>>          */
> >>> -       BUG_ON(start - reuse != PAGE_SIZE);
> >>> +       BUG_ON(start - reuse != PAGE_SIZE && start != reuse);
> >>> +
> >>> +       if (unlikely(reuse == start)) {
> >>> +               int nid = page_to_nid((struct page *)start);
> >>> +               gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY |
> >>> +                                __GFP_NOWARN;
> >>> +
> >>> +               walk.reuse_page = alloc_pages_node(nid, gfp_mask, 0);
> >>> +               if (walk.reuse_page) {
> >>> +                       copy_page(page_to_virt(walk.reuse_page), (void *)reuse);
> >>> +                       list_add(&walk.reuse_page->lru, &vmemmap_pages);
> >>> +               }
> >>> +       }
> >>>
> >>>         mmap_read_lock(&init_mm);
> >>>         ret = vmemmap_remap_range(reuse, end, &walk);
> >>> @@ -348,7 +403,10 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
> >>>         }
> >>>         mmap_read_unlock(&init_mm);
> >>>
> >>> -       free_vmemmap_page_list(&vmemmap_pages);
> >>> +       if (unlikely(reuse == start))
> >>> +               free_vmemmap_pages_nonpcp(&vmemmap_pages);
> >>> +       else
> >>> +               free_vmemmap_pages(&vmemmap_pages);
> >>>
> >>>         return ret;
> >>>  }
> >>> @@ -512,6 +570,22 @@ static bool vmemmap_should_optimize(const struct hstate *h, const struct page *h
> >>>         return true;
> >>>  }
> >>>
> >>> +/*
> >>> + * Control if the page frame which the address of the head vmemmap associated
> >>> + * with a HugeTLB page is mapped to should be replaced with a new page. The
> >>> + * vmemmap pages are usually mapped with huge PMD mapping, the head vmemmap
> >>> + * page frames is best freed to the buddy allocator once at an initial stage
> >>> + * of system booting to reduce memory fragment.
> >>> + */
> >>> +static bool vmemmap_remap_head __ro_after_init = true;
> >>> +
> >>> +static int __init vmemmap_remap_head_init(void)
> >>> +{
> >>> +       vmemmap_remap_head = false;
> >>> +       return 0;
> >>> +}
> >>> +core_initcall(vmemmap_remap_head_init);
> >>> +
> >>>  /**
> >>>   * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages.
> >>>   * @h:         struct hstate.
> >>> @@ -537,6 +611,19 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
> >>>         vmemmap_start   += HUGETLB_VMEMMAP_RESERVE_SIZE;
> >>>
> >>>         /*
> >>> +        * The vmemmap pages are usually mapped with huge PMD mapping. If the
> >>> +        * head vmemmap page is not freed to the buddy allocator, then those
> >>> +        * freed tail vmemmap pages cannot be merged into a big order chunk.
> >>> +        * The head vmemmap page frame can be replaced with a new allocated
> >>> +        * page and be freed to the buddy allocator, then those freed vmemmmap
> >>> +        * pages have the opportunity to be merged into larger contiguous pages
> >>> +        * to reduce memory fragment. vmemmap_remap_free() will do this if
> >>> +        * @vmemmap_remap_free is equal to @vmemmap_reuse.
> >>> +        */
> >>> +       if (unlikely(vmemmap_remap_head))
> >>> +               vmemmap_start = vmemmap_reuse;
> >>> +
> >>
> >> Maybe it doesn't need to strictly early init vs late init.
> >>
> >> I wonder if we can still make this trick late-stage but when the head struct page is
> >> aligned to PAGE_SIZE / sizeof(struct page), and when it is it means that we are safe
> >> to replace the head vmemmap page provided we would only be covering this hugetlb page
> >> data? Unless I am missing something and the check wouldn't be enough?
> >>
> >> A quick check with this snip added:
> >>
> >> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> >> index 2b97df8115fe..06e028734b1e 100644
> >> --- a/mm/hugetlb_vmemmap.c
> >> +++ b/mm/hugetlb_vmemmap.c
> >> @@ -331,7 +331,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
> >>         };
> >>         gfp_t gfp_mask = GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN;
> >>         int nid = page_to_nid((struct page *)start);
> >> -       struct page *page;
> >> +       struct page *page = NULL;
> >>
> >>         /*
> >>          * Allocate a new head vmemmap page to avoid breaking a contiguous
> >> @@ -341,7 +341,8 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
> >>          * more allocations of hugepages. Fallback to the currently
> >>          * mapped head page in case should it fail to allocate.
> >>          */
> >> -       page = alloc_pages_node(nid, gfp_mask, 0);
> >> +       if (IS_ALIGNED(start, PAGE_SIZE))
> >> +               page = alloc_pages_node(nid, gfp_mask, 0);
> >>         walk.head_page = page;
> >>
> >>         /*
> >>
> >>
> >> ... and it looks that it can still be effective just not as much, more or less as expected?
> >>
> >> # with 2M hugepages
> >>
> >> Before:
> >>
> >> Node    0, zone   Normal, type      Movable     76     28     10      4      1      0
> >>  0      0      1      1  15568
> >>
> >> After (allocated 32400):
> >>
> >> Node    0, zone   Normal, type      Movable    135    328    219    198    155    106
> >> 72     41     23      0      0
> >>
> >> Compared to my original patch where there weren't any pages left in the order-0..order-2:
> >>
> >> Node    0, zone   Normal, type      Movable      0      1      0     70
> >>  106     91     78     48     17      0      0
> >>
> >> But still much better that without any of this:
> >>
> >> Node    0, zone   Normal, type      Movable  32174  31999  31642    104
> >> 58     24     16      4      2      0      0
> >>
> 




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux