On Tue, 12 Nov 2024 11:50:39 +0100 Petr Tesarik <ptesarik@xxxxxxxx> wrote: > On Tue, 12 Nov 2024 10:19:34 +0000 > Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > > > On 12/11/2024 09:45, Petr Tesarik wrote: > > > On Mon, 11 Nov 2024 12:25:35 +0000 > > > Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > > > > > >> Hi Petr, > > >> > > >> On 11/11/2024 12:14, Petr Tesarik wrote: > > >>> Hi Ryan, > > >>> > > >>> On Thu, 17 Oct 2024 13:32:43 +0100 > > >>> Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > > >> [...] > > >>> Third, a few micro-benchmarks saw a significant regression. > > >>> > > >>> Most notably, getenv and getenvT2 tests from libMicro were 18% and 20% > > >>> slower with variable page size. I don't know why, but I'm looking into > > >>> it. The system() library call was also about 18% slower, but that might > > >>> be related. > > >> > > >> OK, ouch. I think there are some things we can try to optimize the > > >> implementation further. But I'll wait for your analysis before digging myself. > > > > > > This turned out to be a false positive. The way this microbenchmark was > > > invoked did not get enough samples, so it was mostly dependent on > > > whether caches were hot or cold, and the timing on this specific system > > > with the specific sequence of bencnmarks in the suite happens to favour > > > my baseline kernel. > > > > > > After increasing the batch count, I'm getting pretty much the same > > > performance for 6.11 vanilla and patched kernels: > > > > > > prc thr usecs/call samples errors cnt/samp > > > getenv (baseline) 1 1 0.14975 99 0 100000 > > > getenv (patched) 1 1 0.14981 92 0 100000 > > > > Oh that's good news! Does this account for all 3 of the above tests (getenv, > > getenvT2 and system())? > > It does for getenvT2 (a variant of the test with 2 threads), but not > for system. Thanks for asking, I forgot about that one. > > I'm getting substantial difference there (+29% on average over 100 runs): > > prc thr usecs/call samples errors cnt/samp command > system (baseline) 1 1 6937.18016 102 0 100 A=$$ > system (patched) 1 1 8959.48032 102 0 100 A=$$ > > So, yeah, this should in fact be my priority #1. Further testing reveals the workload is bimodal, that is to say the distribution of results has two peaks. The first peak around 3.2 ms covers 30% runs, the second peak around 15.7 ms covers 11%. Two per cent are faster than the fast peak, 5% are slower than slow peak, the rest is distributed almost evenly between them. 100 samples were not sufficient to see this distribution, and it was mere bad luck that only the patched kernel originally reported bad results. I can now see bad results even with the unpatched kernel. In short, I don't think there is a difference in system() performance. I will still have a look at dup() and VMA performance, but so far it all looks good to me. Good job! ;-) I will also try running a more complete set of benchmarks during next week. That's SUSE Hack Week, and I want to make a PoC for the MM changes I proposed at LPC24, so I won't need this Ampere system for interactive use. Petr T