Re: BUG: MADV_COLLAPSE doesn't work for XFS files]

"Zach O'Keefe" <zokeefe@xxxxxxxxxx> · Thu, 28 Sep 2023 15:48:25 -0700

On Thu, Sep 28, 2023 at 2:04 PM Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
>
> On Thu, Sep 28, 2023 at 12:43:57PM -0700, Zach O'Keefe wrote:
> > Hey Ryan,
> >
> > Thanks for bringing this up.
> >
> > On Thu, Sep 28, 2023 at 4:59 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
> > >
> > > On 28/09/2023 11:54, Bagas Sanjaya wrote:
> > > > On Thu, Sep 28, 2023 at 10:55:17AM +0100, Ryan Roberts wrote:
> > > >> Hi all,
> > > >>
> > > >> I've just noticed that when applied to a file mapping for a file on xfs, MADV_COLLAPSE returns EINVAL. The same test case works fine if the file is on ext4.
> > > >>
> > > >> I think the root cause is that the implementation bails out if it finds a (non-PMD-sized) large folio in the page cache for any part of the file covered by the region. XFS does readahead into large folios so we hit this issue. See khugepaged.h:collapse_file():
> > > >>
> > > >>              if (PageTransCompound(page)) {
> > > >>                      struct page *head = compound_head(page);
> > > >>
> > > >>                      result = compound_order(head) == HPAGE_PMD_ORDER &&
> > > >>                                      head->index == start
> > > >>                                      /* Maybe PMD-mapped */
> > > >>                                      ? SCAN_PTE_MAPPED_HUGEPAGE
> > > >>                                      : SCAN_PAGE_COMPOUND;
> > > >>                      goto out_unlock;
> > > >>              }
> > > >
> >
> > Ya, non-PMD-sized THPs were just barely visible in my peripherals when
> > writing this, and I'm still woefully behind on your work on them now
> > (sorry!).
> >
> > I'd like to eventually make collapse (not just MADV_COLLAPSE, but
> > khugepaged too) support arbitrary-sized large folios in general, but
> > I'm very pressed for time right now. I think M. Wilcox is also
> > interested in this, given he left the TODO to support it :P
>
> Is the point of MADV_COLLAPSE to replace base pages with PMD-sized pages
> in the pagecache for faster lookups?  Or merely to replace them with
> something larger, even if it's not PMD sized?

Might depend on who you ask, but IMHO, the principle purpose of
collapse is saving TLB entries, with TLB coalescing complicating
things a little in terms of PMD-sized things or not. M. Wilcox's work
with descriptor-izing folios might make a nice case for memory savings
as well, down the line.

> As of 6.6, XFS asks for folios of size min(read/readahead/write_len,
> ondisk_mapping_length), so in theory the folio size should roughly
> follow the access patterns.  If the goal is merely larger folios, then
> we are done here and can move on to some other part of the collapse.
>
> OTOH if the goal is TLB savings, then I suppose you'd actually /want/ to
> select a large (but not PMD) folio for collapsing into a PMD sized
> folio, right?

I suppose it might make some operations easier / faster during
collapse if we have less folios to process.

> e.g.
>
>         if (PageTransCompound(page)) {
>                 struct page *head = compound_head(page);
>
>                 if (head->index != start) {
>                         /* not sure what _COMPOUND means here... */
>                         result = SCAN_PAGE_COMPOUND;
>                         goto out_unlock;
>                 }
>
>                 if (compound_order(head) == HPAGE_PMD_ORDER) {
>                         result = SCAN_PTE_MAPPED_HUGEPAGE;
>                         goto out_unlock;
>                 }
>
>                 /* result is still SCAN_SUCCEED, keep going */
>         }
>
> I /think/ that would work?  If the largefolio is dirty or not fully
> uptodate then collapse won't touch it; and I think fs/iomap handles this
> in a compatible way because it won't mark the folio uptodate until all
> the blocks have been read, and it marks the folio dirty if any of the
> blocks are dirty.
>
> (says me, who doesn't really understand this part of the code.)

I think there's a couple issues with this -- for example, the
head->index != start case is going to be the common-case for
non-PMD-sized large folios. Regardless, there is some more code in
hpage_collapse_scan_file() and her in collapse_file() that would need
to be updated. I'm taking a cursory look, and naively it doesn't look
too bad -- most things "should just work" in file/shmem collapse path.
ac492b9c70cac ("mm/khugepaged: skip shmem with userfaultfd" was merged
since the last I looked carefully at this path, so I would need to
spend more time understanding some changes there. So, from correctness
POV, maybe there's not anything drastic to be done for file/shmem.
Maybe this is a good place to start.

For anon, things are different, as we are coming at the pages from a
different angle, and are operating over the pmd directly. I'm not
immediately sure if it makes things easier or harder. Probably harder.
Can we even get non-PMD-sized large anon folios right now, without
Ryan's work?

>From a khugepaged policy POV, there are some questions to be
answered.. but I think these mostly boil down to: scale the
present/swap/none checks by (1 << order).

Anyways, this isn't to be taken with much weight as a thorough audit
is required to understand any subtleties lurking around.

Thanks,
Zach

> --D
>
> > Thank you for the reproducer though! I haven't run it, but I'll
> > probably come back here to steal it when the time comes.
> >
> > > > I don't see any hint to -EINVAL above. Am I missing something?
> > >
> > > The SCAN_PAGE_COMPOUND result ends up back at madvise_collapse() where it
> > > eventually gets converted to -EINVAL by madvise_collapse_errno().
> > >
> > > >
> > > >>
> > > >> I'm not sure if this is already a known issue? I don't have time to work on a fix for this right now, so thought I would highlight it at least. I might get around to it at some point in the future if nobody else tackles it.
> >
> > My guess is Q1 2024 is when I'd be able to look into this, at the
> > current level of urgency. It doesn't sound like it's blocking anything
> > for your work right now -- lmk if that changes though!
> >
> > Thanks,
> > Zach
> >
> >
> >
> > > >>
> > > >> Thanks,
> > > >> Ryan
> > > >>
> > > >>
> > > >> Test case I've been using:
> > > >>
> > > >> -->8--
> > > >>
> > > >> #include <stdio.h>
> > > >> #include <stdlib.h>
> > > >> #include <sys/mman.h>
> > > >> #include <sys/types.h>
> > > >> #include <sys/stat.h>
> > > >> #include <fcntl.h>
> > > >> #include <unistd.h>
> > > >>
> > > >> #ifndef MADV_COLLAPSE
> > > >> #define MADV_COLLAPSE                25
> > > >> #endif
> > > >>
> > > >> #define handle_error(msg)    do { perror(msg); exit(EXIT_FAILURE); } while (0)
> > > >>
> > > >> #define SZ_1K                        1024
> > > >> #define SZ_1M                        (SZ_1K * SZ_1K)
> > > >> #define ALIGN(val, align)    (((val) + ((align) - 1)) & ~((align) - 1))
> > > >>
> > > >> #if 1
> > > >> // ext4
> > > >> #define DATA_FILE            "/home/ubuntu/data.txt"
> > > >> #else
> > > >> // xfs
> > > >> #define DATA_FILE            "/boot/data.txt"
> > > >> #endif
> > > >>
> > > >> int main(void)
> > > >> {
> > > >>      int fd;
> > > >>      char *mem;
> > > >>      int ret;
> > > >>
> > > >>      fd = open(DATA_FILE, O_RDONLY);
> > > >>      if (fd == -1)
> > > >>              handle_error("open");
> > > >>
> > > >>      mem = mmap(NULL, SZ_1M * 4, PROT_READ | PROT_EXEC, MAP_PRIVATE, fd, 0);
> > > >>      close(fd);
> > > >>      if (mem == MAP_FAILED)
> > > >>              handle_error("mmap");
> > > >>
> > > >>      printf("1: pid=%d, mem=%p\n", getpid(), mem);
> > > >>      getchar();
> > > >>
> > > >>      mem = (char *)ALIGN((unsigned long)mem, SZ_1M * 2);
> > > >>      ret = madvise(mem, SZ_1M * 2, MADV_COLLAPSE);
> > > >>      if (ret)
> > > >>              handle_error("madvise");
> > > >>
> > > >>      printf("2: pid=%d, mem=%p\n", getpid(), mem);
> > > >>      getchar();
> > > >>
> > > >>      return 0;
> > > >> }
> > > >>
> > > >> -->8--
> > > >>
> > > >
> > > > Confused...
> > >
> > > This is a user space test case that shows the problem; data.txt needs to be at
> > > least 4MB and on a mounted ext4 and xfs filesystem. By toggling the '#if 1' to
> > > 0, you can see the different behaviours for ext4 and xfs -
> > > handle_error("madvise") fires with EINVAL in the xfs case. The getchar()s are
> > > leftovers from me looking at the smaps file.
> > >