On 22/01/2024 19:43, Yang Shi wrote: > On Mon, Jan 22, 2024 at 3:37 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: >> >> On 20/01/2024 16:39, Matthew Wilcox wrote: >>> On Sat, Jan 20, 2024 at 12:04:27PM +0000, Ryan Roberts wrote: >>>> However, after this patch, each allocation is in its own VMA, and there is a 2M >>>> gap between each VMA. This causes 2 problems: 1) mmap becomes MUCH slower >>>> because there are so many VMAs to check to find a new 1G gap. 2) It fails once >>>> it hits the VMA limit (/proc/sys/vm/max_map_count). Hitting this limit then >>>> causes a subsequent calloc() to fail, which causes the test to fail. >>>> >>>> Looking at the code, I think the problem is that arm64 selects >>>> ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT. But __thp_get_unmapped_area() allocates >>>> len+2M then always aligns to the bottom of the discovered gap. That causes the >>>> 2M hole. As far as I can see, x86 allocates bottom up, so you don't get a hole. >>> >>> As a quick hack, perhaps >>> #ifdef ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT >>> take-the-top-half >>> #else >>> current-take-bottom-half-code >>> #endif >>> >>> ? > > Thanks for the suggestion. It makes sense to me. Doing the alignment > needs to take into account this. > >> >> There is a general problem though that there is a trade-off between abutting >> VMAs, and aligning them to PMD boundaries. This patch has decided that in >> general the latter is preferable. The case I'm hitting is special though, in >> that both requirements could be achieved but currently are not. >> >> The below fixes it, but I feel like there should be some bitwise magic that >> would give the correct answer without the conditional - but my head is gone and >> I can't see it. Any thoughts? > > Thanks Ryan for the patch. TBH I didn't see a bitwise magic without > the conditional either. > >> >> Beyond this, though, there is also a latent bug where the offset provided to >> mmap() is carried all the way through to the get_unmapped_area() >> impelementation, even for MAP_ANONYMOUS - I'm pretty sure we should be >> force-zeroing it for MAP_ANONYMOUS? Certainly before this change, for arches >> that use the default get_unmapped_area(), any non-zero offset would not have >> been used. But this change starts using it, which is incorrect. That said, there >> are some arches that override the default get_unmapped_area() and do use the >> offset. So I'm not sure if this is a bug or a feature that user space can pass >> an arbitrary value to the implementation for anon memory?? > > Thanks for noticing this. If I read the code correctly, the pgoff used > by some arches to workaround VIPT caches, and it looks like it is for > shared mapping only (just checked arm and mips). And I believe > everybody assumes 0 should be used when doing anonymous mapping. The > offset should have nothing to do with seeking proper unmapped virtual > area. But the pgoff does make sense for file THP due to the alignment > requirements. I think it should be zero'ed for anonymous mappings, > like: > > diff --git a/mm/mmap.c b/mm/mmap.c > index 2ff79b1d1564..a9ed353ce627 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1830,6 +1830,7 @@ get_unmapped_area(struct file *file, unsigned > long addr, unsigned long len, > pgoff = 0; > get_area = shmem_get_unmapped_area; > } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { > + pgoff = 0; > /* Ensures that larger anonymous mappings are THP aligned. */ > get_area = thp_get_unmapped_area; > } I think it would be cleaner to just zero pgoff if file==NULL, then it covers the shared case, the THP case, and the non-THP case properly. I'll prepare a separate patch for this. > >> >> Finally, the second test failure I reported (ksm_tests) is actually caused by a >> bug in the test code, but provoked by this change. So I'll send out a fix for >> the test code separately. >> >> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index 4f542444a91f..68ac54117c77 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -632,7 +632,7 @@ static unsigned long __thp_get_unmapped_area(struct file *filp, >> { >> loff_t off_end = off + len; >> loff_t off_align = round_up(off, size); >> - unsigned long len_pad, ret; >> + unsigned long len_pad, ret, off_sub; >> >> if (off_end <= off_align || (off_end - off_align) < size) >> return 0; >> @@ -658,7 +658,13 @@ static unsigned long __thp_get_unmapped_area(struct file *filp, >> if (ret == addr) >> return addr; >> >> - ret += (off - ret) & (size - 1); >> + off_sub = (off - ret) & (size - 1); >> + >> + if (current->mm->get_unmapped_area == arch_get_unmapped_area_topdown && >> + !off_sub) >> + return ret + size; >> + >> + ret += off_sub; >> return ret; >> } > > I didn't spot any problem, would you please come up with a formal patch? Yeah, I'll aim to post today.