Re: [linux-next:master] [mm] 1111d46b5c: stress-ng.pthread.ops_per_sec -84.3% regression

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 12/22/2023 2:14 AM, Matthew Wilcox wrote:
On Thu, Dec 21, 2023 at 10:07:09AM -0800, Yang Shi wrote:
On Wed, Dec 20, 2023 at 8:49 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:

On Thu, Dec 21, 2023 at 08:58:42AM +0800, Yin Fengwei wrote:
Yes. MAP_STACK is also mentioned in manpage of mmap. I did test to
filter out of the MAP_STACK mapping based on this patch. The regression
in stress-ng.pthread was gone. I suppose this is kind of safe because
the madvise call is only applied to glibc allocated stack.


But what I am not sure was whether it's worthy to do such kind of change
as the regression only is seen obviously in micro-benchmark. No evidence
showed the other regressionsin this report is related with madvise. At
least from the perf statstics. Need to check more on stream/ramspeed.

FWIW, we had a customer report a significant performance problem when
inadvertently using 2MB pages for stacks.  They were able to avoid it by
using 2044KiB sized stacks ...

Thanks for the report. This provided more justification regarding
honoring MAP_STACK on Linux. Some applications, for example, pthread,
just allocate a fixed size area for stack. This confuses kernel
because kernel tell stack by VM_GROWSDOWN | VM_GROWSUP.

But I'm still a little confused by why THP for stack could result in
significant performance problems. Unless the applications resize the
stack quite often.

We didn't delve into what was causing the problem, only that it was
happening.  The application had many threads, so it could have been as
simple as consuming all the available THP and leaving fewer available
for other uses.  Or it could have been a memory consumption problem;
maybe the app would only have been using 16-32kB per thread but was
now using 2MB per thread and if there were, say, 100 threads, that's an
extra 199MB of memory in use.
One thing I know is related with the memory zeroing. This is from
the perf data in this report:

0.00 +16.7 16.69 ± 7% perf-profile.calltrace.cycles-pp.clear_page_erms.clear_huge_page.__do_huge_pmd_anonymous_page.__handle_mm_fault.handle_mm_fault

Zeroing 2M memory costs much more CPU than zeroing 16-32KB memory if
there are many threads.


Regards
Yin, Fengwei




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux