On Thu, Aug 26, 2021 at 4:09 AM Lorenz Bauer <lmb@xxxxxxxxxxxxxx> wrote: > > Hi, > > One of the tests for our XDP-based load balancer has gotten quite > slow, so I dug in. Roughly, it simulates 1m distinct packets arriving > at the load balancer by calling BPF_PROG_TEST_RUN a million times. > > distribution_test.go:40: 1000000 iterations > distribution_test.go:99: Coefficient of variation: 0.52% > --- PASS: TestLoadBalancerDistribution (0.00s) > --- PASS: TestLoadBalancerDistribution/32_endpoints (22.04s) > > You can see that the test takes 20s. Running the same test with slight > variations in three threads results in this: > > distribution_test.go:40: 1000000 iterations > === CONT TestLoadBalancerDistribution/32_endpoints > distribution_test.go:99: Coefficient of variation: 0.60% > === CONT TestLoadBalancerDistribution/64_endpoints > distribution_test.go:99: Coefficient of variation: 0.82% > === CONT TestLoadBalancerDistribution/128_endpoints > distribution_test.go:99: Coefficient of variation: 1.24% > --- PASS: TestLoadBalancerDistribution (0.00s) > --- PASS: TestLoadBalancerDistribution/32_endpoints (55.61s) > --- PASS: TestLoadBalancerDistribution/64_endpoints (55.61s) > --- PASS: TestLoadBalancerDistribution/128_endpoints (55.61s) > > It's pretty clear that something is serialising the threads. Digging > around in perf reveals that the culprit is bpf_prog_change_xdp called > from bpf_prog_test_run_xdp. The call was added in f23c4b3924d2 ("bpf: > Start using the BPF dispatcher in BPF_TEST_RUN"). > > Is there something we can do about this? Maybe only call into the > dispatcher when repeat > 1? Are you doing three parallel test_run commands with repeat=1 and doing this syscall 1m times? yeah, that would stress bpf_dispatcher_update() logic nicely :) 3m accesses to the same mutex and flip flop of a single page with tlb flush and text_poke_bp. Can your test harness use test_run with repeat = 1m instead? Or it's not possible, since input data is different every time? I think avoiding xdp dispatcher for repeat=1 makes sense. Folks might be using this facility in similar fashion and paying the dispatcher penalty for a single run is unnecessary. While at it would be good to add the test_run specific xdp dispatcher. Since right now all netdevs share a single global xdp dispatcher. 100 parallel xdp test_run threads will probably fail because they will reach BPF_DISPATCHER_MAX limit. Bjorn, could you make such a change? Other ideas? Thanks!