On Thu, Sep 28, 2023 at 2:04 PM Darrick J. Wong <djwong@xxxxxxxxxx> wrote: > > On Thu, Sep 28, 2023 at 12:43:57PM -0700, Zach O'Keefe wrote: > > Hey Ryan, > > > > Thanks for bringing this up. > > > > On Thu, Sep 28, 2023 at 4:59 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > > > > > > On 28/09/2023 11:54, Bagas Sanjaya wrote: > > > > On Thu, Sep 28, 2023 at 10:55:17AM +0100, Ryan Roberts wrote: > > > >> Hi all, > > > >> > > > >> I've just noticed that when applied to a file mapping for a file on xfs, MADV_COLLAPSE returns EINVAL. The same test case works fine if the file is on ext4. > > > >> > > > >> I think the root cause is that the implementation bails out if it finds a (non-PMD-sized) large folio in the page cache for any part of the file covered by the region. XFS does readahead into large folios so we hit this issue. See khugepaged.h:collapse_file(): > > > >> > > > >> if (PageTransCompound(page)) { > > > >> struct page *head = compound_head(page); > > > >> > > > >> result = compound_order(head) == HPAGE_PMD_ORDER && > > > >> head->index == start > > > >> /* Maybe PMD-mapped */ > > > >> ? SCAN_PTE_MAPPED_HUGEPAGE > > > >> : SCAN_PAGE_COMPOUND; > > > >> goto out_unlock; > > > >> } > > > > > > > > Ya, non-PMD-sized THPs were just barely visible in my peripherals when > > writing this, and I'm still woefully behind on your work on them now > > (sorry!). > > > > I'd like to eventually make collapse (not just MADV_COLLAPSE, but > > khugepaged too) support arbitrary-sized large folios in general, but > > I'm very pressed for time right now. I think M. Wilcox is also > > interested in this, given he left the TODO to support it :P > > Is the point of MADV_COLLAPSE to replace base pages with PMD-sized pages > in the pagecache for faster lookups? Or merely to replace them with > something larger, even if it's not PMD sized? Might depend on who you ask, but IMHO, the principle purpose of collapse is saving TLB entries, with TLB coalescing complicating things a little in terms of PMD-sized things or not. M. Wilcox's work with descriptor-izing folios might make a nice case for memory savings as well, down the line. > As of 6.6, XFS asks for folios of size min(read/readahead/write_len, > ondisk_mapping_length), so in theory the folio size should roughly > follow the access patterns. If the goal is merely larger folios, then > we are done here and can move on to some other part of the collapse. > > OTOH if the goal is TLB savings, then I suppose you'd actually /want/ to > select a large (but not PMD) folio for collapsing into a PMD sized > folio, right? I suppose it might make some operations easier / faster during collapse if we have less folios to process. > e.g. > > if (PageTransCompound(page)) { > struct page *head = compound_head(page); > > if (head->index != start) { > /* not sure what _COMPOUND means here... */ > result = SCAN_PAGE_COMPOUND; > goto out_unlock; > } > > if (compound_order(head) == HPAGE_PMD_ORDER) { > result = SCAN_PTE_MAPPED_HUGEPAGE; > goto out_unlock; > } > > /* result is still SCAN_SUCCEED, keep going */ > } > > I /think/ that would work? If the largefolio is dirty or not fully > uptodate then collapse won't touch it; and I think fs/iomap handles this > in a compatible way because it won't mark the folio uptodate until all > the blocks have been read, and it marks the folio dirty if any of the > blocks are dirty. > > (says me, who doesn't really understand this part of the code.) I think there's a couple issues with this -- for example, the head->index != start case is going to be the common-case for non-PMD-sized large folios. Regardless, there is some more code in hpage_collapse_scan_file() and her in collapse_file() that would need to be updated. I'm taking a cursory look, and naively it doesn't look too bad -- most things "should just work" in file/shmem collapse path. ac492b9c70cac ("mm/khugepaged: skip shmem with userfaultfd" was merged since the last I looked carefully at this path, so I would need to spend more time understanding some changes there. So, from correctness POV, maybe there's not anything drastic to be done for file/shmem. Maybe this is a good place to start. For anon, things are different, as we are coming at the pages from a different angle, and are operating over the pmd directly. I'm not immediately sure if it makes things easier or harder. Probably harder. Can we even get non-PMD-sized large anon folios right now, without Ryan's work? >From a khugepaged policy POV, there are some questions to be answered.. but I think these mostly boil down to: scale the present/swap/none checks by (1 << order). Anyways, this isn't to be taken with much weight as a thorough audit is required to understand any subtleties lurking around. Thanks, Zach > --D > > > Thank you for the reproducer though! I haven't run it, but I'll > > probably come back here to steal it when the time comes. > > > > > > I don't see any hint to -EINVAL above. Am I missing something? > > > > > > The SCAN_PAGE_COMPOUND result ends up back at madvise_collapse() where it > > > eventually gets converted to -EINVAL by madvise_collapse_errno(). > > > > > > > > > > >> > > > >> I'm not sure if this is already a known issue? I don't have time to work on a fix for this right now, so thought I would highlight it at least. I might get around to it at some point in the future if nobody else tackles it. > > > > My guess is Q1 2024 is when I'd be able to look into this, at the > > current level of urgency. It doesn't sound like it's blocking anything > > for your work right now -- lmk if that changes though! > > > > Thanks, > > Zach > > > > > > > > > >> > > > >> Thanks, > > > >> Ryan > > > >> > > > >> > > > >> Test case I've been using: > > > >> > > > >> -->8-- > > > >> > > > >> #include <stdio.h> > > > >> #include <stdlib.h> > > > >> #include <sys/mman.h> > > > >> #include <sys/types.h> > > > >> #include <sys/stat.h> > > > >> #include <fcntl.h> > > > >> #include <unistd.h> > > > >> > > > >> #ifndef MADV_COLLAPSE > > > >> #define MADV_COLLAPSE 25 > > > >> #endif > > > >> > > > >> #define handle_error(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0) > > > >> > > > >> #define SZ_1K 1024 > > > >> #define SZ_1M (SZ_1K * SZ_1K) > > > >> #define ALIGN(val, align) (((val) + ((align) - 1)) & ~((align) - 1)) > > > >> > > > >> #if 1 > > > >> // ext4 > > > >> #define DATA_FILE "/home/ubuntu/data.txt" > > > >> #else > > > >> // xfs > > > >> #define DATA_FILE "/boot/data.txt" > > > >> #endif > > > >> > > > >> int main(void) > > > >> { > > > >> int fd; > > > >> char *mem; > > > >> int ret; > > > >> > > > >> fd = open(DATA_FILE, O_RDONLY); > > > >> if (fd == -1) > > > >> handle_error("open"); > > > >> > > > >> mem = mmap(NULL, SZ_1M * 4, PROT_READ | PROT_EXEC, MAP_PRIVATE, fd, 0); > > > >> close(fd); > > > >> if (mem == MAP_FAILED) > > > >> handle_error("mmap"); > > > >> > > > >> printf("1: pid=%d, mem=%p\n", getpid(), mem); > > > >> getchar(); > > > >> > > > >> mem = (char *)ALIGN((unsigned long)mem, SZ_1M * 2); > > > >> ret = madvise(mem, SZ_1M * 2, MADV_COLLAPSE); > > > >> if (ret) > > > >> handle_error("madvise"); > > > >> > > > >> printf("2: pid=%d, mem=%p\n", getpid(), mem); > > > >> getchar(); > > > >> > > > >> return 0; > > > >> } > > > >> > > > >> -->8-- > > > >> > > > > > > > > Confused... > > > > > > This is a user space test case that shows the problem; data.txt needs to be at > > > least 4MB and on a mounted ext4 and xfs filesystem. By toggling the '#if 1' to > > > 0, you can see the different behaviours for ext4 and xfs - > > > handle_error("madvise") fires with EINVAL in the xfs case. The getchar()s are > > > leftovers from me looking at the smaps file. > > >