On Wed, Dec 20, 2023 at 7:42 AM Christoph Lameter (Ampere) <cl@xxxxxxxxx> wrote: > > On Wed, 20 Dec 2023, Yin Fengwei wrote: > > >> Interesting, wasn't the same regression seen last time? And I'm a > >> little bit confused about how pthread got regressed. I didn't see the > >> pthread benchmark do any intensive memory alloc/free operations. Do > >> the pthread APIs do any intensive memory operations? I saw the > >> benchmark does allocate memory for thread stack, but it should be just > >> 8K per thread, so it should not trigger what this patch does. With > >> 1024 threads, the thread stacks may get merged into one single VMA (8M > >> total), but it may do so even though the patch is not applied. > > stress-ng.pthread test code is strange here: > > > > https://github.com/ColinIanKing/stress-ng/blob/master/stress-pthread.c#L573 > > > > Even it allocates its own stack, but that attr is not passed > > to pthread_create. So it's still glibc to allocate stack for > > pthread which is 8M size. This is why this patch can impact > > the stress-ng.pthread testing. > > Hmmm... The use of calloc() for 8M triggers an mmap I guess. > > Why is that memory slower if we align the adress to a 2M boundary? Because > THP can act faster and creates more overhead? glibc calls madvise() to free unused stack, that may have higher cost due to THP (splitting pmd, deferred split queue, etc). > > > while this time, the hotspot is in (pmd_lock from do_madvise I suppose): > > - 55.02% zap_pmd_range.isra.0 > > - 53.42% __split_huge_pmd > > - 51.74% _raw_spin_lock > > - 51.73% native_queued_spin_lock_slowpath > > + 3.03% asm_sysvec_call_function > > - 1.67% __split_huge_pmd_locked > > - 0.87% pmdp_invalidate > > + 0.86% flush_tlb_mm_range > > - 1.60% zap_pte_range > > - 1.04% page_remove_rmap > > 0.55% __mod_lruvec_page_state > > Ok so we have 2M mappings and they are split because of some action on 4K > segments? Guess because of the guard pages? It should not relate to guard pages, just due to free unused stack which may be partial 2M. > > >> More time spent in madvise and munmap. but I'm not sure whether this > >> is caused by tearing down the address space when exiting the test. If > >> so it should not count in the regression. > > It's not for the whole address space tearing down. It's for pthread > > stack tearing down when pthread exit (can be treated as address space > > tearing down? I suppose so). > > > > https://github.com/lattera/glibc/blob/master/nptl/allocatestack.c#L384 > > https://github.com/lattera/glibc/blob/master/nptl/pthread_create.c#L576 > > > > Another thing is whether it's worthy to make stack use THP? It may be > > useful for some apps which need large stack size? > > No can do since a calloc is used to allocate the stack. How can the kernel > distinguish the allocation? Just by VM_GROWSDOWN | VM_GROWSUP. The user space needs to tell kernel this area is stack by setting proper flags. For example, ffffca1df000-ffffca200000 rw-p 00000000 00:00 0 [stack] Size: 132 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 60 kB Pss: 60 kB Pss_Dirty: 60 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 60 kB Referenced: 60 kB Anonymous: 60 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB FilePmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 0 VmFlags: rd wr mr mw me gd ac The "gd" flag means GROWSDOWN. But it totally depends on glibc in terms of how it considers about "stack". So glibc just uses calloc() to allocate stack area. > >