Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)

Yang Shi <shy828301@xxxxxxxxx> · Thu, 12 Aug 2021 11:19:34 -0700

On Fri, Aug 6, 2021 at 10:57 AM Yang Shi <shy828301@xxxxxxxxx> wrote:
>
> On Thu, Aug 5, 2021 at 10:43 PM Hugh Dickins <hughd@xxxxxxxxxx> wrote:
> >
> > On Thu, 5 Aug 2021, Yang Shi wrote:
> > >
> > > By rereading the code, I think you are correct. Both cases do work
> > > correctly without leaking. And the !CONFIG_NUMA case may carry the
> > > huge page indefinitely.
> > >
> > > I think it is because khugepaged may collapse memory for another NUMA
> > > node in the next loop, so it doesn't make too much sense to carry the
> > > huge page, but it may be an optimization for !CONFIG_NUMA case.
> >
> > Yes, that is its intention.
> >
> > >
> > > However, as I mentioned in earlier email the new pcp implementation
> > > could cache THP now, so we might not need keep this convoluted logic
> > > anymore. Just free the page if collapse is failed then re-allocate
> > > THP. The carried THP might improve the success rate a little bit but I
> > > doubt how noticeable it would be, may be not worth for the extra
> > > complexity at all.
> >
> > It would be great if the new pcp implementation is good enough to
> > get rid of khugepaged's confusing NUMA=y/NUMA=n differences; and all
> > the *hpage stuff too, I hope.  That would be a welcome cleanup.
>
>  The other question is if that optimization is worth it nowadays or
> not. I bet not too many users build NUMA=n kernel nowadays even though
> the kernel is actually running on a non-NUMA machine. Some small
> devices may run NUMA=n kernel, but I don't think they actually use
> THP. So such code complexity could be removed from this point of view
> too.
>
> >
> > > > > Collapse failure is not uncommon and leaking huge pages gets noticed.
> >
> > After writing that, I realized how I'm almost always testing a NUMA=y
> > kernel (though on non-NUMA machines), and seldom try the NUMA=n build.
> > So did so to check no leak, indeed; but was surprised, when comparing
> > vmstats, that the NUMA=n run had done 5 times as much thp_collapse_alloc
> > as the NUMA=y run.  I've merely made a note to look into that one day:
> > maybe it was just a one-off oddity, or maybe the incrementing of stats
> > is wrong down one path or the other.

I came up with a patch to remove !CONFIG_NUMA case, and my test found
the same problem. NUMA=n run had done 5 times as much
thp_collapse_alloc as NUMA=y run with vanilla kernel just exactly as
what you saw.

A quick look shows the huge page allocation timing is different for
the two cases. For NUMA=n, the huge page is allocated by
khugepaged_prealloc_page() before scanning the address space, so it
means huge page may be allocated even though there is no suitable
range for collapsing. Then the page would be just freed if khugepaged
already made enough progress then try to reallocate again. The problem
should be more noticeable if you have a shorter scan interval
(scan_sleep_millisecs). I set it to 100ms for my test.

We could carry the huge page across scan passes for NUMA=n, but this
would make the code more complicated. I don't think it is really
worth, so just removing the special case for NUMA=n sounds more
reasonable to me.

>
> Yeah, probably.
>
> >
> > Hugh