Re: [PATCH v1] mm/hugetlb_vmemmap: remap head page to newly allocated page

Muchun Song <songmuchun@xxxxxxxxxxxxx> · Thu, 4 Aug 2022 16:05:00 +0800

On Wed, Aug 03, 2022 at 03:42:15PM -0700, Mike Kravetz wrote:
> On 08/03/22 18:44, Muchun Song wrote:
> > On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote:
> > > On 8/3/22 05:11, Muchun Song wrote:
> > > > On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote:
> > > >> Today with `hugetlb_free_vmemmap=on` the struct page memory that is
> > > >> freed back to page allocator is as following: for a 2M hugetlb page it
> > > >> will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap
> > > >> pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap
> > > >> pages. Essentially, that means that it breaks the first 4K of a
> > > >> potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or
> > > >> 16M (for 1G hugetlb pages). For this reason the memory that it's free
> > > >> back to page allocator cannot be used for hugetlb to allocate huge pages
> > > >> of the same size, but rather only of a smaller huge page size:
> > > >>
> > > > 
> > > > Hi Joao,
> > > > 
> > > > Thanks for your work on this. I admit you are right. The current mechanism
> > > > prevented the freed vmemmap pages from being mergerd into a potential
> > > > contiguous page. Allocating a new head page is straightforward approach,
> > > > however, it is very dangerous at runtime after system booting up. Why
> > > > dangerous? Because you should first 1) copy the content from the head vmemmap
> > > > page to the targeted (newly allocated) page, and then 2) change the PTE
> > > > entry to the new page. However, the content (especially the refcount) of the
> > > > old head vmemmmap page could be changed elsewhere (e.g. other modules)
> > > > between the step 1) and 2). Eventually, the new allocated vmemmap page is
> > > > corrupted. Unluckily, we don't have an easy approach to prevent it.
> > > > 
> > > OK, I see what I missed. You mean the refcount (or any other data) on the
> > > preceeding struct pages to this head struct page unrelated to the hugetlb
> > > page being remapped. Meaning when the first struct page in the old vmemmap
> > > page is *not* the head page we are trying to remap right?
> > >
> > > See further below in your patch but I wonder if we could actually check this
> > > from the hugetlb head va being aligned to PAGE_SIZE. Meaning that we would be checking
> > > that this head page is the first of struct page in the vmemmap page that
> > > we are still safe to remap? If so, the patch could be simpler more
> > > like mine, without the special freeing path you added below.
> > > 
> > > If I'm right, see at the end.
> > 
> > 
> > I am not sure we are on the same page (it seems that we are not after I saw your
> > below patch). So let me make it become more clarified.
> 
> Thanks Muchun!
> I told Joao that you would be the expert in this area and was correct.
>

Thanks for your trust.

> > 
> > CPU0:                                            CPU1:
> > 
> > vmemmap_remap_free(start, end, reuse)
> >   // alloc a new page used to be the head vmemmap page
> >   page = alloc_pages_node();
> > 
> >   memcpy(page_address(page), reuse, PAGE_SIZE);
> >                                                  // Now the @reuse address is mapped to the original
> >                                                  // page frame. So the change will be reflected on the
> >                                                  // original page frame.
> >                                                  get_page(reuse);
> 
> Here is a thought.
> 
> This code gets called early after allocating a new hugetlb page.  This new
> compound page has a ref count of 1 on the head page and 0 on all the tail
> pages.  If the ref count was 0 on the head page, get_page() would not succeed.
> 
> I can not think of a reason why we NEED to have a ref count of 1 on the
> head page.  It does make it more convenient to simply call put_page() on
> the newly initialized hugetlb page and have the page be added to the huegtlb
> free lists, but this could easily be modified.
> 
> Matthew Willcox recently pointed me at this:
> https://lore.kernel.org/linux-mm/20220531150611.1303156-1-willy@xxxxxxxxxxxxx/T/#m98fb9f9bd476155b4951339da51a0887b2377476
> 
> That would only work for hugetlb pages allocated from buddy.  For gigantic
> pages, we manually 'freeze' (zero ref count) of tail pages and check for
> an unexpected increased ref count.  We could do the same with the head page.
> 
> Would having zero ref count on the head page eliminate this race?

I think most races which try to grab an extra refcount could be avoided
in this case.

However, I can figure out a race related to memory failure, that is to poison
a page in memory_failure(), which set PageHWPoison without checking if the
refcount is zero. If we can fix this easily, I think this patch is a good
direction.

Thanks Mike.

> -- 
> Mike Kravetz
> 
> >   vmemmap_remap_pte();
> >    // remap to the above new allocated page 
> >    set_pte_at();
> > 
> >   flush_tlb_kernel_range();
> >                                                  // Now the @reuse address is mapped to the new allocated
> >                                                  // page frame. So the change will be reflected on the
> >                                                  // new page frame and it is corrupted.
> >                                                  put_page(reuse);
> > 
> > So we should make 1) memcpy, 2) remap and 3) TLB flush atomic on CPU0, however
> > it is not easy.
>