Re: mm/khugepaged: collapse file/shmem compound pages

"Zach O'Keefe" <zokeefe@xxxxxxxxxx> · Sun, 29 May 2022 14:36:29 -0700

On Fri, May 27, 2022 at 8:48 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> On Fri, May 27, 2022 at 09:27:33AM -0700, Zach O'Keefe wrote:
> > On Thu, May 26, 2022 at 8:47 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> > > Because PageTransCompound() does not do what it says on the tin.
> > >
> > > static inline int PageTransCompound(struct page *page)
> > > {
> > >         return PageCompound(page);
> > > }
> > >
> > > So any compound page is treated as if it's a PMD-sized page.
> >
> > Right - therein lies the problem :) I think I misattributed your
> > comment "we'll simply skip over it because the code believes that
> > means it's already a PMD" as a solution, not as the current state of
> > things. What we need to be able to do is:
> >
> > 1) If folio order == 0: do what we've been doing
> > 2) If folio order == HPAGE_PMD_ORDER: check if it's _actually_
> > pmd-mapped. If it is, we're done. If not, continue to step (3)
>
> I would not do that part.  Just leave it alone and assume everything's
> good.

Sorry if I keep pressing the issue here - but why not check? If the
goal of khugepaged (and certainly MADV_COLLAPSE) is to map eligible
memory at the pmd level, then these pte-mapped hugepages that we might
discover in step (2) are actually the cheapest memory to collapse
since we can do the collapse in-place.

> > 3) Else (folio order > 0 and not pmd-mapped): new magic; hopefully
> > it's ~ same as step (1)
>
> Yes, exactly this.
>
> > > > I thought the point / benefit of khugepaged was precisely to try and
> > > > find places where we can collapse many pte entries into a single pmd
> > > > mapping?
> > >
> > > Ideally, yes.  But if a file is mapped at an address which isn't
> > > PMD-aligned, it can't.  Maybe it should just decline to operate in that
> > > case.
> >
> > To make sure I'm not missing anything here: It's not actually
> > important that the file is mapped at a pmd-aligned address. All that
> > is important is that the region of memory being collapsed is
> > pmd-aligned. If we wanted to collapse memory mapped to the start of
> > the file, then sure, the file has to be mapped suitably.
>
> Ah, what you're probably missing is that for file pages/folios, they
> have to be naturally aligned.  The data structure underlying the
> page cache simply can't cope with askew pages.  (It kind of can under
> some circumstances that are so complicated that they shouldn't be
> explained, and it's far easier just to say "folios must be naturally
> aligned within the file")

I'm trying to understand what you mean by "naturally aligned" here.
I'm operating under the assumption that all file pages map to
page-sized offsets within a file (e.g. linear_page_address()) and that
files are mapped at a page-aligned address. In the event we want to
collapse file-backed memory, if the virtual address of said memory is
hugepage-aligned, I don't think it's necessary that the address maps
to a hugepage-sized offset in the file? I.e. on x86, the file could
itself be mapped to the start of the last page in a 2MiB region ,X,
and that wouldn't prevent us from collapsing the 2MiB region starting
at X+4KiB.

> > > > > shmem still expects folios to be of order either 0 or PMD_ORDER.
> > > > > That assumption extends into the swap code and I haven't had the heart
> > > > > to go and fix all those places yet.  Plus Neil was doing major surgery
> > > > > to the swap code in the most recent deveopment cycle and I didn't want
> > > > > to get in his way.
> > > > >
> > > > > So I am absolutely fine with khugepaged allocating a PMD-size folio for
> > > > > any inode that claims mapping_large_folio_support().  If any filesystems
> > > > > break, we'll fix them.
> > > >
> > > > Just for clarification, what is the equivalent code today that
> > > > enforces mapping_large_folio_support()? I.e. today, khugepaged can
> > > > successfully collapse file without checking if the inode supports it
> > > > (we only check that it's a regular file not opened for writing).
> > >
> > > Yeah, that's a dodgy hack which needs to go away.  But we need a lot
> > > more filesystems converted to supporting large folios before we can
> > > delete it.  Not your responsibility; I'm doing my best to encourage
> > > fs maintainers to do this part.
> >
> > Got it. In the meantime, do we want to check the old conditions +
> > mapping_large_folio_support()?
>
> Yes, that should work.  khugepaged should be free to create large
> folios if the underlying filesystem supports them OR (executable,
> read-only, CONFIG_THP_RO, etc, etc).

Thanks for confirming!

> > > > Also, just to check, there isn't anything wrong with following
> > > > collapse_file()'s approach, even for folios of 0 < order <
> > > > HPAGE_PMD_ORDER? I.e this part:
> > > >
> > > >  * Basic scheme is simple, details are more complex:
> > > >  *  - allocate and lock a new huge page;
> > > >  *  - scan page cache replacing old pages with the new one
> > > >  *    + swap/gup in pages if necessary;
> > > >  *    + fill in gaps;
> > > >  *    + keep old pages around in case rollback is required;
> > > >  *  - if replacing succeeds:
> > > >  *    + copy data over;
> > > >  *    + free old pages;
> > > >  *    + unlock huge page;
> > > >  *  - if replacing failed;
> > > >  *    + put all pages back and unfreeze them;
> > > >  *    + restore gaps in the page cache;
> > > >  *    + unlock and free huge page;
> > > >  */
> > >
> > > Correct.  At least, as far as I know!  Working on folios has been quite
> > > the education for me ...
> >
> > Great! Well, perhaps I'll run into a snafu here or there (and
> > hopefully learn something myself) but this gives me enough confidence
> > to naively give it a try and see what happens!
> >
> > Again, thank you very much for your time, help and advice with this,
>
> You're welcome!  Thanks for putting in some work on this project!

No problem! Hopefully this can benefit a bunch of existing users.

Thanks again,
Zach