Re: mm/khugepaged: collapse file/shmem compound pages

"Zach O'Keefe" <zokeefe@xxxxxxxxxx> · Fri, 27 May 2022 09:27:33 -0700

On Thu, May 26, 2022 at 8:47 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> On Thu, May 26, 2022 at 05:54:27PM -0700, Zach O'Keefe wrote:
> > On Wed, May 25, 2022 at 8:36 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> > > On Wed, May 25, 2022 at 06:23:52PM -0700, Zach O'Keefe wrote:
> > > > On Wed, May 25, 2022 at 12:07 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> > > > > Anyway, that meaning behind that comment is that the PageTransCompound()
> > > > > test is going to be true on any compound page (TransCompound doesn't
> > > > > check that the page is necessarily a THP).  So that particular test should
> > > > > be folio_test_pmd_mappable(), but there are probably other things which
> > > > > ought to be changed, including converting the entire file from dealing
> > > > > in pages to dealing in folios.
> > > >
> > > > Right, at this point, the page might be a pmd-mapped THP, or it could
> > > > be a pte-mapped compound page (I'm unsure if we can encounter compound
> > > > pages outside hugepages).
> > >
> > > Today, there is a way.  We can find a folio with an order between 0 and
> > > PMD_ORDER if the underlying filesystem supports large folios and the
> > > file is executable and we've enabled CONFIG_READ_ONLY_THP_FOR_FS.
> > > In this case, we'll simply skip over it because the code believes that
> > > means it's already a PMD.
> >
> > I think I'm missing something here - sorry. If the folio order is  <
> > HPAGE_PMD_ORDER, why does the code think it's a pmd?
>
> Because PageTransCompound() does not do what it says on the tin.
>
> static inline int PageTransCompound(struct page *page)
> {
>         return PageCompound(page);
> }
>
> So any compound page is treated as if it's a PMD-sized page.

Right - therein lies the problem :) I think I misattributed your
comment "we'll simply skip over it because the code believes that
means it's already a PMD" as a solution, not as the current state of
things. What we need to be able to do is:

1) If folio order == 0: do what we've been doing
2) If folio order == HPAGE_PMD_ORDER: check if it's _actually_
pmd-mapped. If it is, we're done. If not, continue to step (3)
3) Else (folio order > 0 and not pmd-mapped): new magic; hopefully
it's ~ same as step (1)

> > > > If we could tell it's already pmd-mapped, we're done :) IIUC,
> > > > folio_test_pmd_mappable() is a necessary but not sufficient condition
> > > > to determine this.
> > >
> > > It is necessary, but from khugepaged's point of view, it's sufficient
> > > because khugepaged's job is to create PMD-sized folios -- it's not up to
> > > khugepaged to ensure that PMD-sized folios are actually mapped using
> > > a PMD.
> >
> > I thought the point / benefit of khugepaged was precisely to try and
> > find places where we can collapse many pte entries into a single pmd
> > mapping?
>
> Ideally, yes.  But if a file is mapped at an address which isn't
> PMD-aligned, it can't.  Maybe it should just decline to operate in that
> case.

To make sure I'm not missing anything here: It's not actually
important that the file is mapped at a pmd-aligned address. All that
is important is that the region of memory being collapsed is
pmd-aligned. If we wanted to collapse memory mapped to the start of
the file, then sure, the file has to be mapped suitably.

> > > There may be some other component of the system (eg DAMON?)
> > > which has chosen to temporarily map the PMD-sized folio using PTEs
> > > in order to track whether the memory is all being used.  It may also
> > > be the case that (for file-based memory), the VMA is mis-aligned and
> > > despite creating a PMD-sized folio, it can't be mapped with a PMD.
> >
> > AFAIK DAMON doesn't do this pmd splitting to do subpage tracking for
> > THPs. Also, I believe retract_page_tables() does make the check to see
> > if the address is suitably hugepage aligned/sized.
>
> Maybe not DAMON itself, but it's something that various people are
> talkig about doing; trying to determine whether THPs are worth using or
> whether userspace has made the magic go-faster call without knowing
> whether the valuable 2MB page is being entirely used.

Got it - thanks for clarifying.

> > > shmem still expects folios to be of order either 0 or PMD_ORDER.
> > > That assumption extends into the swap code and I haven't had the heart
> > > to go and fix all those places yet.  Plus Neil was doing major surgery
> > > to the swap code in the most recent deveopment cycle and I didn't want
> > > to get in his way.
> > >
> > > So I am absolutely fine with khugepaged allocating a PMD-size folio for
> > > any inode that claims mapping_large_folio_support().  If any filesystems
> > > break, we'll fix them.
> >
> > Just for clarification, what is the equivalent code today that
> > enforces mapping_large_folio_support()? I.e. today, khugepaged can
> > successfully collapse file without checking if the inode supports it
> > (we only check that it's a regular file not opened for writing).
>
> Yeah, that's a dodgy hack which needs to go away.  But we need a lot
> more filesystems converted to supporting large folios before we can
> delete it.  Not your responsibility; I'm doing my best to encourage
> fs maintainers to do this part.

Got it. In the meantime, do we want to check the old conditions +
mapping_large_folio_support()?

> > Also, just to check, there isn't anything wrong with following
> > collapse_file()'s approach, even for folios of 0 < order <
> > HPAGE_PMD_ORDER? I.e this part:
> >
> >  * Basic scheme is simple, details are more complex:
> >  *  - allocate and lock a new huge page;
> >  *  - scan page cache replacing old pages with the new one
> >  *    + swap/gup in pages if necessary;
> >  *    + fill in gaps;
> >  *    + keep old pages around in case rollback is required;
> >  *  - if replacing succeeds:
> >  *    + copy data over;
> >  *    + free old pages;
> >  *    + unlock huge page;
> >  *  - if replacing failed;
> >  *    + put all pages back and unfreeze them;
> >  *    + restore gaps in the page cache;
> >  *    + unlock and free huge page;
> >  */
>
> Correct.  At least, as far as I know!  Working on folios has been quite
> the education for me ...

Great! Well, perhaps I'll run into a snafu here or there (and
hopefully learn something myself) but this gives me enough confidence
to naively give it a try and see what happens!

Again, thank you very much for your time, help and advice with this,

Best,
Zach