On Wed, 20 Dec 2023, Yin Fengwei wrote:
Interesting, wasn't the same regression seen last time? And I'm a
little bit confused about how pthread got regressed. I didn't see the
pthread benchmark do any intensive memory alloc/free operations. Do
the pthread APIs do any intensive memory operations? I saw the
benchmark does allocate memory for thread stack, but it should be just
8K per thread, so it should not trigger what this patch does. With
1024 threads, the thread stacks may get merged into one single VMA (8M
total), but it may do so even though the patch is not applied.
stress-ng.pthread test code is strange here:
https://github.com/ColinIanKing/stress-ng/blob/master/stress-pthread.c#L573
Even it allocates its own stack, but that attr is not passed
to pthread_create. So it's still glibc to allocate stack for
pthread which is 8M size. This is why this patch can impact
the stress-ng.pthread testing.
Hmmm... The use of calloc() for 8M triggers an mmap I guess.
Why is that memory slower if we align the adress to a 2M boundary? Because
THP can act faster and creates more overhead?
while this time, the hotspot is in (pmd_lock from do_madvise I suppose):
- 55.02% zap_pmd_range.isra.0
- 53.42% __split_huge_pmd
- 51.74% _raw_spin_lock
- 51.73% native_queued_spin_lock_slowpath
+ 3.03% asm_sysvec_call_function
- 1.67% __split_huge_pmd_locked
- 0.87% pmdp_invalidate
+ 0.86% flush_tlb_mm_range
- 1.60% zap_pte_range
- 1.04% page_remove_rmap
0.55% __mod_lruvec_page_state
Ok so we have 2M mappings and they are split because of some action on 4K
segments? Guess because of the guard pages?
More time spent in madvise and munmap. but I'm not sure whether this
is caused by tearing down the address space when exiting the test. If
so it should not count in the regression.
It's not for the whole address space tearing down. It's for pthread
stack tearing down when pthread exit (can be treated as address space
tearing down? I suppose so).
https://github.com/lattera/glibc/blob/master/nptl/allocatestack.c#L384
https://github.com/lattera/glibc/blob/master/nptl/pthread_create.c#L576
Another thing is whether it's worthy to make stack use THP? It may be
useful for some apps which need large stack size?
No can do since a calloc is used to allocate the stack. How can the kernel
distinguish the allocation?