On Mon, Jun 27, 2022 at 2:34 PM Feng Tang <feng.tang@xxxxxxxxx> wrote: > > On Mon, Jun 27, 2022 at 10:46:21AM +0200, Eric Dumazet wrote: > > On Mon, Jun 27, 2022 at 4:38 AM Feng Tang <feng.tang@xxxxxxxxx> wrote: > [snip] > > > > > > > > > > Thanks Feng. Can you check the value of memory.kmem.tcp.max_usage_in_bytes > > > > > in /sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service after making > > > > > sure that the netperf test has already run? > > > > > > > > memory.kmem.tcp.max_usage_in_bytes:0 > > > > > > Sorry, I made a mistake that in the original report from Oliver, it > > > was 'cgroup v2' with a 'debian-11.1' rootfs. > > > > > > When you asked about cgroup info, I tried the job on another tbox, and > > > the original 'job.yaml' didn't work, so I kept the 'netperf' test > > > parameters and started a new job which somehow run with a 'debian-10.4' > > > rootfs and acutally run with cgroup v1. > > > > > > And as you mentioned cgroup version does make a big difference, that > > > with v1, the regression is reduced to 1% ~ 5% on different generations > > > of test platforms. Eric mentioned they also got regression report, > > > but much smaller one, maybe it's due to the cgroup version? > > > > This was using the current net-next tree. > > Used recipe was something like: > > > > Make sure cgroup2 is mounted or mount it by mount -t cgroup2 none $MOUNT_POINT. > > Enable memory controller by echo +memory > $MOUNT_POINT/cgroup.subtree_control. > > Create a cgroup by mkdir $MOUNT_POINT/job. > > Jump into that cgroup by echo $$ > $MOUNT_POINT/job/cgroup.procs. > > > > <Launch tests> > > > > The regression was smaller than 1%, so considered noise compared to > > the benefits of the bug fix. > > Yes, 1% is just around noise level for a microbenchmark. > > I went check the original test data of Oliver's report, the tests was > run 6 rounds and the performance data is pretty stable (0Day's report > will show any std deviation bigger than 2%) > > The test platform is a 4 sockets 72C/144T machine, and I run the > same job (nr_tasks = 25% * nr_cpus) on one CascadeLake AP (4 nodes) > and one Icelake 2 sockets platform, and saw 75% and 53% regresson on > them. > > In the first email, there is a file named 'reproduce', it shows the > basic test process: > > " > use 'performane' cpufre governor for all CPUs > > netserver -4 -D > modprobe sctp > netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & > netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & > netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & > (repeat 36 times in total) > ... > > " > > Which starts 36 (25% of nr_cpus) netperf clients. And the clients number > also matters, I tried to increase the client number from 36 to 72(50%), > and the regression is changed from 69.4% to 73.7%" > This seems like a lot of opportunities for memcg folks :) struct page_counter has poor field placement [1], and no per-cpu cache. [1] "atomic_long_t usage" is sharing cache line with read mostly fields. (struct mem_cgroup also has poor field placement, mainly because of struct page_counter) 28.69% [kernel] [k] copy_user_enhanced_fast_string 16.13% [kernel] [k] intel_idle_irq 6.46% [kernel] [k] page_counter_try_charge 6.20% [kernel] [k] __sk_mem_reduce_allocated 5.68% [kernel] [k] try_charge_memcg 5.16% [kernel] [k] page_counter_cancel > Thanks, > Feng > > > > > > > Thanks, > > > Feng