Re: mm/khugepaged: collapse file/shmem compound pages

"Zach O'Keefe" <zokeefe@xxxxxxxxxx> · Thu, 26 May 2022 17:54:27 -0700

Hey Matthew,

Thanks for the response!

On Wed, May 25, 2022 at 8:36 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> On Wed, May 25, 2022 at 06:23:52PM -0700, Zach O'Keefe wrote:
> > On Wed, May 25, 2022 at 12:07 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> > > The khugepaged code (like much of the mm used to) assumes that memory
> > > comes in two sizes, PTE and PMD.  That's still true for anon and shmem
> > > for now, but hopefully we'll start managing both anon & shmem memory in
> > > larger chunks, without necessarily going as far as PMD.
> > >
> > > I think the purpose of khugepaged should continue to be to construct
> > > PMD-size pages; I don't see the point of it wandering through process VMs
> > > replacing order-2 pages with order-5 pages.  I may be wrong about that,
> > > of course, so feel free to argue with me.
> >
> > I'd agree here.
> >
> > > Anyway, that meaning behind that comment is that the PageTransCompound()
> > > test is going to be true on any compound page (TransCompound doesn't
> > > check that the page is necessarily a THP).  So that particular test should
> > > be folio_test_pmd_mappable(), but there are probably other things which
> > > ought to be changed, including converting the entire file from dealing
> > > in pages to dealing in folios.
> >
> > Right, at this point, the page might be a pmd-mapped THP, or it could
> > be a pte-mapped compound page (I'm unsure if we can encounter compound
> > pages outside hugepages).
>
> Today, there is a way.  We can find a folio with an order between 0 and
> PMD_ORDER if the underlying filesystem supports large folios and the
> file is executable and we've enabled CONFIG_READ_ONLY_THP_FOR_FS.
> In this case, we'll simply skip over it because the code believes that
> means it's already a PMD.

I think I'm missing something here - sorry. If the folio order is  <
HPAGE_PMD_ORDER, why does the code think it's a pmd?

> > If we could tell it's already pmd-mapped, we're done :) IIUC,
> > folio_test_pmd_mappable() is a necessary but not sufficient condition
> > to determine this.
>
> It is necessary, but from khugepaged's point of view, it's sufficient
> because khugepaged's job is to create PMD-sized folios -- it's not up to
> khugepaged to ensure that PMD-sized folios are actually mapped using
> a PMD.

I thought the point / benefit of khugepaged was precisely to try and
find places where we can collapse many pte entries into a single pmd
mapping?

> There may be some other component of the system (eg DAMON?)
> which has chosen to temporarily map the PMD-sized folio using PTEs
> in order to track whether the memory is all being used.  It may also
> be the case that (for file-based memory), the VMA is mis-aligned and
> despite creating a PMD-sized folio, it can't be mapped with a PMD.

AFAIK DAMON doesn't do this pmd splitting to do subpage tracking for
THPs. Also, I believe retract_page_tables() does make the check to see
if the address is suitably hugepage aligned/sized.

> > Else, if it's not, is it safe to try and continue? Suppose we find a
> > folio of 0 < order < HPAGE_PMD_ORDER. Are we safely able to try and
> > extend it, or will we break some filesystems that expect a certain
> > order folio?
>
> We're not giving filesystems the opportunity to request that ;-)
> Filesystems are expected to handle folios of arbitrary order (if they
> claim the ability to support large folios at all).  In practice, I've
> capped the folio creation size at PMD_ORDER (because I don't want to track
> down all the places that assume pmd_page() is necessarily a head page),
> but filesystems shouldn't rely on it.

Ok, that's good to hear :)

> shmem still expects folios to be of order either 0 or PMD_ORDER.
> That assumption extends into the swap code and I haven't had the heart
> to go and fix all those places yet.  Plus Neil was doing major surgery
> to the swap code in the most recent deveopment cycle and I didn't want
> to get in his way.
>
> So I am absolutely fine with khugepaged allocating a PMD-size folio for
> any inode that claims mapping_large_folio_support().  If any filesystems
> break, we'll fix them.

Just for clarification, what is the equivalent code today that
enforces mapping_large_folio_support()? I.e. today, khugepaged can
successfully collapse file without checking if the inode supports it
(we only check that it's a regular file not opened for writing).

Also, just to check, there isn't anything wrong with following
collapse_file()'s approach, even for folios of 0 < order <
HPAGE_PMD_ORDER? I.e this part:

 * Basic scheme is simple, details are more complex:
 *  - allocate and lock a new huge page;
 *  - scan page cache replacing old pages with the new one
 *    + swap/gup in pages if necessary;
 *    + fill in gaps;
 *    + keep old pages around in case rollback is required;
 *  - if replacing succeeds:
 *    + copy data over;
 *    + free old pages;
 *    + unlock huge page;
 *  - if replacing failed;
 *    + put all pages back and unfreeze them;
 *    + restore gaps in the page cache;
 *    + unlock and free huge page;
 */

> > > I actually have one patch which starts in that direction, but I haven't
> > > followed it up yet with all the other patches to that file which will
> > > be needed:
> >
> > Thanks for the head start! Not an expert here, but would you say
> > converting this file to use folios is a necessary first step?
>
> Not _necessary_, but I find it helps keep things clearer.  Plus it's
> something that needs to happen anyway.

Got it. Perhaps something I can help with then.

Thanks again for your time advising on this!

Best,
Zach