Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64

Petr Tesarik <ptesarik@xxxxxxxx> · Wed, 13 Nov 2024 13:40:38 +0100

On Tue, 12 Nov 2024 11:50:39 +0100
Petr Tesarik <ptesarik@xxxxxxxx> wrote:

> On Tue, 12 Nov 2024 10:19:34 +0000
> Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
> 
> > On 12/11/2024 09:45, Petr Tesarik wrote:  
> > > On Mon, 11 Nov 2024 12:25:35 +0000
> > > Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
> > >     
> > >> Hi Petr,
> > >>
> > >> On 11/11/2024 12:14, Petr Tesarik wrote:    
> > >>> Hi Ryan,
> > >>>
> > >>> On Thu, 17 Oct 2024 13:32:43 +0100
> > >>> Ryan Roberts <ryan.roberts@xxxxxxx> wrote:    
> > >> [...]    
> > >>> Third, a few micro-benchmarks saw a significant regression.
> > >>>
> > >>> Most notably, getenv and getenvT2 tests from libMicro were 18% and 20%
> > >>> slower with variable page size. I don't know why, but I'm looking into
> > >>> it. The system() library call was also about 18% slower, but that might
> > >>> be related.      
> > >>
> > >> OK, ouch. I think there are some things we can try to optimize the
> > >> implementation further. But I'll wait for your analysis before digging myself.    
> > > 
> > > This turned out to be a false positive. The way this microbenchmark was
> > > invoked did not get enough samples, so it was mostly dependent on
> > > whether caches were hot or cold, and the timing on this specific system
> > > with the specific sequence of bencnmarks in the suite happens to favour
> > > my baseline kernel.
> > > 
> > > After increasing the batch count, I'm getting pretty much the same
> > > performance for 6.11 vanilla and patched kernels:
> > > 
> > >                         prc thr   usecs/call      samples   errors cnt/samp 
> > > getenv (baseline)         1   1      0.14975           99        0   100000 
> > > getenv (patched)          1   1      0.14981           92        0   100000     
> > 
> > Oh that's good news! Does this account for all 3 of the above tests (getenv,
> > getenvT2 and system())?  
> 
> It does for getenvT2 (a variant of the test with 2 threads), but not
> for system. Thanks for asking, I forgot about that one.
> 
> I'm getting substantial difference there (+29% on average over 100 runs):
> 
>                         prc thr   usecs/call      samples   errors cnt/samp  command
> system (baseline)         1   1   6937.18016          102        0      100     A=$$
> system (patched)          1   1   8959.48032          102        0      100     A=$$
> 
> So, yeah, this should in fact be my priority #1.

Further testing reveals the workload is bimodal, that is to say the
distribution of results has two peaks. The first peak around 3.2 ms
covers 30% runs, the second peak around 15.7 ms covers 11%. Two per
cent are faster than the fast peak, 5% are slower than slow peak, the
rest is distributed almost evenly between them.

100 samples were not sufficient to see this distribution, and it was
mere bad luck that only the patched kernel originally reported bad
results. I can now see bad results even with the unpatched kernel.

In short, I don't think there is a difference in system() performance.

I will still have a look at dup() and VMA performance, but so far it
all looks good to me. Good job! ;-)

I will also try running a more complete set of benchmarks during next
week. That's SUSE Hack Week, and I want to make a PoC for the MM
changes I proposed at LPC24, so I won't need this Ampere system for
interactive use.

Petr T