Re: [PATCH v1] mm/hugetlb_vmemmap: remap head page to newly allocated page

Mike Kravetz <mike.kravetz@xxxxxxxxxx> · Thu, 4 Aug 2022 09:56:39 -0700

On 08/04/22 16:05, Muchun Song wrote:
> On Wed, Aug 03, 2022 at 03:42:15PM -0700, Mike Kravetz wrote:
> > On 08/03/22 18:44, Muchun Song wrote:
> > > On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote:
> > > > On 8/3/22 05:11, Muchun Song wrote:
> > > > > On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote:
> > > I am not sure we are on the same page (it seems that we are not after I saw your
> > > below patch). So let me make it become more clarified.
> > 
> > Thanks Muchun!
> > I told Joao that you would be the expert in this area and was correct.
> >
> 
> Thanks for your trust.
>  
> > > 
> > > CPU0:                                            CPU1:
> > > 
> > > vmemmap_remap_free(start, end, reuse)
> > >   // alloc a new page used to be the head vmemmap page
> > >   page = alloc_pages_node();
> > > 
> > >   memcpy(page_address(page), reuse, PAGE_SIZE);
> > >                                                  // Now the @reuse address is mapped to the original
> > >                                                  // page frame. So the change will be reflected on the
> > >                                                  // original page frame.
> > >                                                  get_page(reuse);
> > 
> > Here is a thought.
> > 
> > This code gets called early after allocating a new hugetlb page.  This new
> > compound page has a ref count of 1 on the head page and 0 on all the tail
> > pages.  If the ref count was 0 on the head page, get_page() would not succeed.
> > 
> > I can not think of a reason why we NEED to have a ref count of 1 on the
> > head page.  It does make it more convenient to simply call put_page() on
> > the newly initialized hugetlb page and have the page be added to the huegtlb
> > free lists, but this could easily be modified.
> > 
> > Matthew Willcox recently pointed me at this:
> > https://lore.kernel.org/linux-mm/20220531150611.1303156-1-willy@xxxxxxxxxxxxx/T/#m98fb9f9bd476155b4951339da51a0887b2377476
> > 
> > That would only work for hugetlb pages allocated from buddy.  For gigantic
> > pages, we manually 'freeze' (zero ref count) of tail pages and check for
> > an unexpected increased ref count.  We could do the same with the head page.
> > 
> > Would having zero ref count on the head page eliminate this race?
> 
> I think most races which try to grab an extra refcount could be avoided
> in this case.
> 
> However, I can figure out a race related to memory failure, that is to poison
> a page in memory_failure(), which set PageHWPoison without checking if the
> refcount is zero. If we can fix this easily, I think this patch is a good
> direction.

Adding Naoya.

Yes, I recall that discussion in the thread,
https://lore.kernel.org/linux-mm/3c542543-0965-ef60-4627-1a4116077a5b@xxxxxxxxxx/

I don't have any ideas about how to avoid that issue.

Naoya notes in that thread, the issue is poisioning a generic compound page
that may turn into a hugetlb page.  There may be a way to work around this.

However, IMO we may be trying too hard to cover ALL memory error handling races
with hugetlb.  We know that memory errors can happen at ANY time, and a page
could be marked poision at any time.  I suspect there are many paths where bad
things could happen if a memory error happens at 'just the right time'.  For
example, I believe a page could be poisioned at the very end of the page
allocation path and returned to the caller requesting the page.  Certainly,
not every caller of page allocation checks for and handles a poisioned page.
In general, we know that memory error handling will not be 100% perfect
and can not be handled in ALL code paths.  In some cases if a memory error
happens at 'just the right time', bad things will happen.  It would be almost
impossible to handle a memory error at any time in all code paths. Someone
please correct me if this is not accepted/known situation.

It seems there has been much effort lately to try and catch all hugetlb races
with memory error handling.  That is OK, but I think we need to accept that
there may be races with code paths that are not worth the effort to try and
cover.  Just my opinion of course.

For this specific proposal by Joao, I think we can handle most of the races
if all sub-pages of the hugetlb (including head) have a ref count of zero.
I actually like moving in this direction as it means we could remove some
existing hugetlb code checking for increased ref counts.

I can start working on code to move in this direction.  We will need to
wait for the alloc_frozen_pages() interface and will need to make changes
to other allocators for gigantic pages.  Until then we can manually zero the
ref counts before calling the vmemmap freeing routines.  With this in place,
I would say that we do something like Joao proposes to free more contiguous
vmemmap.

Thoughts?

In any case, I am going to move forward with setting ref count to zero of
all 'hugetlb pages' before they officially become hugetlb pages.  As mentioned,
this should allow for removal of some code checking for inflated ref counts.
-- 
Mike Kravetz