On Tue, Oct 24, 2023 at 11:09 PM Oliver Sang <oliver.sang@xxxxxxxxx> wrote: > > hi, Yosry Ahmed, > > On Tue, Oct 24, 2023 at 12:14:42AM -0700, Yosry Ahmed wrote: > > On Mon, Oct 23, 2023 at 11:56 PM Oliver Sang <oliver.sang@xxxxxxxxx> wrote: > > > > > > hi, Yosry Ahmed, > > > > > > On Mon, Oct 23, 2023 at 07:13:50PM -0700, Yosry Ahmed wrote: > > > > > > ... > > > > > > > > > > > I still could not run the benchmark, but I used a version of > > > > fallocate1.c that does 1 million iterations. I ran 100 in parallel. > > > > This showed ~13% regression with the patch, so not the same as the > > > > will-it-scale version, but it could be an indicator. > > > > > > > > With that, I did not see any improvement with the fixlet above or > > > > ___cacheline_aligned_in_smp. So you can scratch that. > > > > > > > > I did, however, see some improvement with reducing the indirection > > > > layers by moving stats_updates directly into struct mem_cgroup. The > > > > regression in my manual testing went down to 9%. Still not great, but > > > > I am wondering how this reflects on the benchmark. If you're able to > > > > test it that would be great, the diff is below. Meanwhile I am still > > > > looking for other improvements that can be made. > > > > > > we applied previous patch-set as below: > > > > > > c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing > > > ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent() > > > 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg > > > 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code > > > 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time > > > 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything <---- the base our tool picked for the patch set > > > > > > I tried to apply below patch to either 51d74c18a9c61 or c5f50d8b23c79, > > > but failed. could you guide how to apply this patch? > > > Thanks > > > > > > > Thanks for looking into this. I rebased the diff on top of > > c5f50d8b23c79. Please find it attached. > > from our tests, this patch has little impact. > > it was applied as below ac6a9444dec85: > > ac6a9444dec85 (linux-devel/fixup-c5f50d8b23c79) memcg: move stats_updates to struct mem_cgroup > c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing > ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent() > 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg > 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code > 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time > 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything > > for the first regression reported in original report, data are very close > for 51d74c18a9c61, c5f50d8b23c79 (patch-set tip, parent of ac6a9444dec85), > and ac6a9444dec85. > full comparison is as [1] > > ========================================================================================= > compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase: > gcc-12/performance/x86_64-rhel-8.3/thread/100%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale > > 130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7 > ---------------- --------------------------- --------------------------- --------------------------- > %stddev %change %stddev %change %stddev %change %stddev > \ | \ | \ | \ > 36509 -25.8% 27079 -25.2% 27305 -25.0% 27383 will-it-scale.per_thread_ops > > for the second regression reported in origianl report, seems a small impact > from ac6a9444dec85. > full comparison is as [2] > > ========================================================================================= > compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase: > gcc-12/performance/x86_64-rhel-8.3/thread/50%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale > > 130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7 > ---------------- --------------------------- --------------------------- --------------------------- > %stddev %change %stddev %change %stddev %change %stddev > \ | \ | \ | \ > 76580 -30.0% 53575 -28.9% 54415 -26.7% 56152 will-it-scale.per_thread_ops > > [1] > Thanks Oliver for running the numbers. If I understand correctly the will-it-scale.fallocate1 microbenchmark is the only one showing significant regression here, is this correct? In my runs, other more representative microbenchmarks benchmarks like netperf and will-it-scale.page_fault* show minimal regression. I would expect practical workloads to have high concurrency of page faults or networking, but maybe not fallocate/ftruncate. Oliver, in your experience, how often does such a regression in such a microbenchmark translate to a real regression that people care about? (or how often do people dismiss it?) I tried optimizing this further for the fallocate/ftruncate case but without luck. I even tried moving stats_updates into cgroup core (struct cgroup_rstat_cpu) to reuse the existing loop in cgroup_rstat_updated() -- but it somehow made it worse. On the other hand, we do have some machines in production running this series together with a previous optimization for non-hierarchical stats [1] on an older kernel, and we do see significant reduction in cpu time spent on reading the stats. Domenico did a similar experiment with only this series and reported similar results [2]. Shakeel, Johannes, (and other memcg folks), I personally think the benefits here outweigh a regression in this particular benchmark, but I am obviously biased. What do you think? [1]https://lore.kernel.org/lkml/20230726153223.821757-2-yosryahmed@xxxxxxxxxx/ [2]https://lore.kernel.org/lkml/CAFYChMv_kv_KXOMRkrmTN-7MrfgBHMcK3YXv0dPYEL7nK77e2A@xxxxxxxxxxxxxx/