Re: [PATCH v3 03/12] mm/khugepaged: make hugepage allocation context-specific

Peter Xu <peterx@xxxxxxxxxx> · Thu, 28 Apr 2022 11:52:21 -0400

On Thu, Apr 28, 2022 at 08:37:06AM -0700, Zach O'Keefe wrote:
> On Thu, Apr 28, 2022 at 7:51 AM Peter Xu <peterx@xxxxxxxxxx> wrote:
> >
> > On Wed, Apr 27, 2022 at 05:51:53PM -0700, Zach O'Keefe wrote:
> > > > > @@ -1110,13 +1115,14 @@ static void collapse_huge_page(struct mm_struct *mm,
> > > > >        */
> > > > >       mmap_read_unlock(mm);
> > > > >
> > > > > +     node = khugepaged_find_target_node(cc);
> > > > >       /* sched to specified node before huage page memory copy */
> > > > >       if (task_node(current) != node) {
> > > > >               cpumask = cpumask_of_node(node);
> > > > >               if (!cpumask_empty(cpumask))
> > > > >                       set_cpus_allowed_ptr(current, cpumask);
> > > > >       }
> > > > > -     new_page = khugepaged_alloc_page(hpage, gfp, node);
> > > > > +     new_page = cc->alloc_hpage(cc, gfp, node);
> > > >
> > > > AFAICT you removed all references of khugepaged_alloc_page() in this patch,
> > > > then you'd better drop the function for both NUMA and UMA.
> > > >
> > >
> > > Sorry, I'm not sure I follow. In khugepaged context, logic WRT
> > > khugepaged_alloc_page() is unchanged - it's just called indirectly
> > > through ->alloc_hpage().
> >
> > Ah you're right, sorry for the confusion.
> >
> > >
> > > > Said that, I think it's actually better to keep them, as they do things
> > > > useful.  For example,AFAICT this ->alloc_hpage() change can leak the hpage
> > > > alloacted for UMA already so that's why I think keeping them makes sense,
> > > > then iiuc the BUG_ON would trigger with UMA already.
> > > >
> > > > I saw that you've moved khugepaged_find_target_node() into this function
> > > > which looks definitely good, but if we could keep khugepaged_alloc_page()
> > > > and then IMHO we could even move khugepaged_find_target_node() into that
> > > > and drop the "node" parameter in khugepaged_alloc_page().
> > > >
> > >
> > > I actually had done this, until commit 4272063db38c ("mm/khugepaged:
> > > sched to numa node when collapse huge page") which forced me to keep
> > > "node" visible in this function.
> >
> > Right, I was looking at my local tree and that patch seems to be very
> > lately added into -next.
> >
> > I'm curious why it wasn't applying to file thps too if it is worthwhile,
> > since if so that's also a suitable candidate to be moved into the same
> > hook.  I've asked in that thread instead.
> >
> > Before that, feel free to keep your code as is.
> >
> 
> Thanks for checking on that. On second thought, it seems like we would
> actually want to move this sched logic into the khupaged hook impl
> since we probably don't want to be moving around the calling process
> if in madvise context.

Sounds correct, if it's moved over it must only be in the new helper (if
you're going to introduce it, like below :) that was only for khugepaged.

Actually I really start to question that idea more..  e.g. even for
khugepaged the target node can be changing rapidly depending on the owner
of pages planned to collapse.

Whether does it make sense to bounce khugepaged thread itself seems still
questionable to me, and definitely not a good idea for madvise() context.
The only proof in that patch was a whole testbench result but I'm not sure
how reliable that'll be when applied to real world scenarios.

> 
> > >
> > > > I'd even consider moving cc->gfp() all into it if possible, since the gfp
> > > > seems to be always with __GFP_THISNODE anyways.
> > > >
> > >
> > > I would have liked to do this, but the gfp flags are referenced again
> > > when calling mem_cgroup_charge(), so I couldn't quite get rid of them
> > > from here.
> >
> > Yeah, maybe we could move mem_cgroup_charge() into the hook too?  As below
> > codes are duplicated between file & anon and IMHO they're good candidate to
> > a new helper already anyway:
> >
> >         /* Only allocate from the target node */
> >         gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> >
> >         new_page = khugepaged_alloc_page(hpage, gfp, node);
> >         if (!new_page) {
> >                 result = SCAN_ALLOC_HUGE_PAGE_FAIL;
> >                 goto out;
> >         }
> >
> >         if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
> >                 result = SCAN_CGROUP_CHARGE_FAIL;
> >                 goto out;
> >         }
> >         count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
> >
> > If we want to generalize it maybe we want to return the "result" instead of
> > the new page though, so the newpage can be passed in as a pointer.
> >
> > There's one mmap_read_unlock(mm) for the anonymous code but I think that
> > can simply be moved above the chunk.
> >
> > No strong opinion here, please feel free to think about what's the best
> > approach for landing this series.
> >
> 
> Thanks for the suggestion. I'll play around with it a little and see
> what looks good.

Sounds good!

If the new helper would be welcomed then it can be a standalone
pre-requisite patch to clean things up first.  Let's also see how it goes
with the other patch, because there's chance you can also cleanup
khugepaged_find_target_node() in the same patch along with all above.

-- 
Peter Xu