On Wed, 13 Nov 2024 12:56:24 +0000 Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > On 13/11/2024 12:40, Petr Tesarik wrote: > > On Tue, 12 Nov 2024 11:50:39 +0100 > > Petr Tesarik <ptesarik@xxxxxxxx> wrote: > > > >> On Tue, 12 Nov 2024 10:19:34 +0000 > >> Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > >> > >>> On 12/11/2024 09:45, Petr Tesarik wrote: > >>>> On Mon, 11 Nov 2024 12:25:35 +0000 > >>>> Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > >>>> > >>>>> Hi Petr, > >>>>> > >>>>> On 11/11/2024 12:14, Petr Tesarik wrote: > >>>>>> Hi Ryan, > >>>>>> > >>>>>> On Thu, 17 Oct 2024 13:32:43 +0100 > >>>>>> Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > >>>>> [...] > >>>>>> Third, a few micro-benchmarks saw a significant regression. > >>>>>> > >>>>>> Most notably, getenv and getenvT2 tests from libMicro were 18% and 20% > >>>>>> slower with variable page size. I don't know why, but I'm looking into > >>>>>> it. The system() library call was also about 18% slower, but that might > >>>>>> be related. > >>>>> > >>>>> OK, ouch. I think there are some things we can try to optimize the > >>>>> implementation further. But I'll wait for your analysis before digging myself. > >>>> > >>>> This turned out to be a false positive. The way this microbenchmark was > >>>> invoked did not get enough samples, so it was mostly dependent on > >>>> whether caches were hot or cold, and the timing on this specific system > >>>> with the specific sequence of bencnmarks in the suite happens to favour > >>>> my baseline kernel. > >>>> > >>>> After increasing the batch count, I'm getting pretty much the same > >>>> performance for 6.11 vanilla and patched kernels: > >>>> > >>>> prc thr usecs/call samples errors cnt/samp > >>>> getenv (baseline) 1 1 0.14975 99 0 100000 > >>>> getenv (patched) 1 1 0.14981 92 0 100000 > >>> > >>> Oh that's good news! Does this account for all 3 of the above tests (getenv, > >>> getenvT2 and system())? > >> > >> It does for getenvT2 (a variant of the test with 2 threads), but not > >> for system. Thanks for asking, I forgot about that one. > >> > >> I'm getting substantial difference there (+29% on average over 100 runs): > >> > >> prc thr usecs/call samples errors cnt/samp command > >> system (baseline) 1 1 6937.18016 102 0 100 A=$$ > >> system (patched) 1 1 8959.48032 102 0 100 A=$$ > >> > >> So, yeah, this should in fact be my priority #1. > > > > Further testing reveals the workload is bimodal, that is to say the > > distribution of results has two peaks. The first peak around 3.2 ms > > covers 30% runs, the second peak around 15.7 ms covers 11%. Two per > > cent are faster than the fast peak, 5% are slower than slow peak, the > > rest is distributed almost evenly between them. > > FWIW, One source of bimodality I've seen on Ampere systems with 2 NUMA nodes is > placement of the kernel image vs placement of the running thread. If they are > remote from eachother, you'll see a slowdown. I've hacked this source away in > the past by effectively using only a single NUMA node (with the help of > 'maxcpus' and 'mem' kernel cmdline options). This system has only one NUMA node. But your comment leads in the right direction. CPU placement does play a role here. I can consistently get the fast results if I pin the benchmark process to a single CPU core, or more generally to a CPU set which shares the L2 cache (as found on eMAG). But the scheduler only considers LLC, which (with CONFIG_SCHED_CLUSTER=y) follows the complex affinity of the SLC. Long story short, without explicit affinity, the scheduler may place a forked child onto a CPU with a cold L2 cache, which harms short-lived processes (like the ones created by this benchmark). Now it all makes sense and it is totally unrelated to dynamic page size selection. :-) Petr T