Re: mm/khugepaged: collapse file/shmem compound pages

Rongwei Wang <rongwei.wang@xxxxxxxxxxxxxxxxx> · Wed, 1 Jun 2022 19:26:51 +0800

On 6/1/22 1:19 PM, Zach O'Keefe wrote:
On Sun, May 29, 2022 at 6:25 PM Rongwei Wang
<rongwei.wang@xxxxxxxxxxxxxxxxx> wrote:

On 5/30/22 5:36 AM, Zach O'Keefe wrote:
On Fri, May 27, 2022 at 8:48 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:

On Fri, May 27, 2022 at 09:27:33AM -0700, Zach O'Keefe wrote:
On Thu, May 26, 2022 at 8:47 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
Because PageTransCompound() does not do what it says on the tin.

static inline int PageTransCompound(struct page *page)
{
          return PageCompound(page);
}

So any compound page is treated as if it's a PMD-sized page.

Right - therein lies the problem :) I think I misattributed your
comment "we'll simply skip over it because the code believes that
means it's already a PMD" as a solution, not as the current state of
things. What we need to be able to do is:

1) If folio order == 0: do what we've been doing
2) If folio order == HPAGE_PMD_ORDER: check if it's _actually_
pmd-mapped. If it is, we're done. If not, continue to step (3)

I would not do that part.  Just leave it alone and assume everything's
good.

Sorry if I keep pressing the issue here - but why not check? If the
goal of khugepaged (and certainly MADV_COLLAPSE) is to map eligible
memory at the pmd level, then these pte-mapped hugepages that we might
discover in step (2) are actually the cheapest memory to collapse
since we can do the collapse in-place.

3) Else (folio order > 0 and not pmd-mapped): new magic; hopefully
it's ~ same as step (1)

Yes, exactly this.

I thought the point / benefit of khugepaged was precisely to try and
find places where we can collapse many pte entries into a single pmd
mapping?

Ideally, yes.  But if a file is mapped at an address which isn't
PMD-aligned, it can't.  Maybe it should just decline to operate in that
case.

To make sure I'm not missing anything here: It's not actually
important that the file is mapped at a pmd-aligned address. All that
is important is that the region of memory being collapsed is
pmd-aligned. If we wanted to collapse memory mapped to the start of
the file, then sure, the file has to be mapped suitably.

Ah, what you're probably missing is that for file pages/folios, they
have to be naturally aligned.  The data structure underlying the
page cache simply can't cope with askew pages.  (It kind of can under
some circumstances that are so complicated that they shouldn't be
explained, and it's far easier just to say "folios must be naturally
aligned within the file")

I'm trying to understand what you mean by "naturally aligned" here.
I'm operating under the assumption that all file pages map to
page-sized offsets within a file (e.g. linear_page_address()) and that
files are mapped at a page-aligned address. In the event we want to
collapse file-backed memory, if the virtual address of said memory is
hugepage-aligned, I don't think it's necessary that the address maps
to a hugepage-sized offset in the file? I.e. on x86, the file could
Hi, Zach

I'm not sure get your question rightly. We submitted patch set to
support file THP can been used transparently, likes THP[1].

[1]https://lore.kernel.org/linux-mm/20211009092658.59665-2-rongwei.wang@xxxxxxxxxxxxxxxxx/

In this patch, I remember that we need to check if '(vma->vm_start >>
PAGE_SHIFT) - vma->vm_pgoff' is align with HPAGE_PMD_NR. likes:

+static inline bool vma_is_hugetext(struct vm_area_struct *vma,
+                                  unsigned long vm_flags)
+{
+       if (!(vm_flags & VM_EXEC))
+               return false;
+
+       if (vma->vm_file && !inode_is_open_for_write(vma->vm_file->f_inode))
+               return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
+                               HPAGE_PMD_NR);
+
+       return false;
+}

There is a little different with anon THP here.
itself be mapped to the start of the last page in a 2MiB region ,X,

Maybe it's related with ELF's 'Align' parameter in your current system?
If 'Align' set to 2MB (by 'readelf -l /path/exec'), it's probably meets
the above alignment check.

And the default Align parameter is related to binutils version, also can
be set in compile time by '-z max-page-size=<align size>' option.

Hope it is helpful:)

-wrw

Hey Rongwei!

Thanks for the code / help. Took a little bit, but hughd has
enlightened me on the problem (thanks Hugh!). Likewise, apologies for
not understanding your previous comment regarding folio alignment,
Matthew.

Also, thanks for linking your patchset, and sorry for missing it
previously. It seems we're interested in the same problem! Hopefully
this work can be beneficial to your use case as well.
Hi Zach!

Thanks you, too. Recently, we are trying to use process_madvise()+DAMON 
to find hot .text, especially x86, and then collapsing into huge pages. 
It seems that process_madvise(MADV_COLLAPSE) is feasible.

Anyway, thanks your nice work.
-wrw

Thanks again for your time,
Zach

and that wouldn't prevent us from collapsing the 2MiB region starting
at X+4KiB.

shmem still expects folios to be of order either 0 or PMD_ORDER.
That assumption extends into the swap code and I haven't had the heart
to go and fix all those places yet.  Plus Neil was doing major surgery
to the swap code in the most recent deveopment cycle and I didn't want
to get in his way.

So I am absolutely fine with khugepaged allocating a PMD-size folio for
any inode that claims mapping_large_folio_support().  If any filesystems
break, we'll fix them.

Just for clarification, what is the equivalent code today that
enforces mapping_large_folio_support()? I.e. today, khugepaged can
successfully collapse file without checking if the inode supports it
(we only check that it's a regular file not opened for writing).

Yeah, that's a dodgy hack which needs to go away.  But we need a lot
more filesystems converted to supporting large folios before we can
delete it.  Not your responsibility; I'm doing my best to encourage
fs maintainers to do this part.

Got it. In the meantime, do we want to check the old conditions +
mapping_large_folio_support()?

Yes, that should work.  khugepaged should be free to create large
folios if the underlying filesystem supports them OR (executable,
read-only, CONFIG_THP_RO, etc, etc).

Thanks for confirming!

Also, just to check, there isn't anything wrong with following
collapse_file()'s approach, even for folios of 0 < order <
HPAGE_PMD_ORDER? I.e this part:

   * Basic scheme is simple, details are more complex:
   *  - allocate and lock a new huge page;
   *  - scan page cache replacing old pages with the new one
   *    + swap/gup in pages if necessary;
   *    + fill in gaps;
   *    + keep old pages around in case rollback is required;
   *  - if replacing succeeds:
   *    + copy data over;
   *    + free old pages;
   *    + unlock huge page;
   *  - if replacing failed;
   *    + put all pages back and unfreeze them;
   *    + restore gaps in the page cache;
   *    + unlock and free huge page;
   */

Correct.  At least, as far as I know!  Working on folios has been quite
the education for me ...

Great! Well, perhaps I'll run into a snafu here or there (and
hopefully learn something myself) but this gives me enough confidence
to naively give it a try and see what happens!

Again, thank you very much for your time, help and advice with this,

You're welcome!  Thanks for putting in some work on this project!

No problem! Hopefully this can benefit a bunch of existing users.

Thanks again,
Zach