Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64

Petr Tesarik <ptesarik@xxxxxxxx> · Tue, 12 Nov 2024 10:45:44 +0100

On Mon, 11 Nov 2024 12:25:35 +0000
Ryan Roberts <ryan.roberts@xxxxxxx> wrote:

> Hi Petr,
> 
> On 11/11/2024 12:14, Petr Tesarik wrote:
> > Hi Ryan,
> > 
> > On Thu, 17 Oct 2024 13:32:43 +0100
> > Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>[...]
> > Third, a few micro-benchmarks saw a significant regression.
> > 
> > Most notably, getenv and getenvT2 tests from libMicro were 18% and 20%
> > slower with variable page size. I don't know why, but I'm looking into
> > it. The system() library call was also about 18% slower, but that might
> > be related.  
> 
> OK, ouch. I think there are some things we can try to optimize the
> implementation further. But I'll wait for your analysis before digging myself.

This turned out to be a false positive. The way this microbenchmark was
invoked did not get enough samples, so it was mostly dependent on
whether caches were hot or cold, and the timing on this specific system
with the specific sequence of bencnmarks in the suite happens to favour
my baseline kernel.

After increasing the batch count, I'm getting pretty much the same
performance for 6.11 vanilla and patched kernels:

                        prc thr   usecs/call      samples   errors cnt/samp 
getenv (baseline)         1   1      0.14975           99        0   100000 
getenv (patched)          1   1      0.14981           92        0   100000 

> You probably also saw the conversation with Catalin about the cost vs benefit of
> this series. Performance regressions will all need to be considered in the cost
> column, of course. So understanding the root cause and trying to reduce the
> regression as much as possible will increase chances of getting it accepted
> upstream.

Yes. Now that the biggest number is off the table, I'm going to:

 - look into the dup() slowdown
 - verify whether VMA split/merge operations are indeed slower

Petr T