On Mon, Aug 16, 2021 at 7:19 PM Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > On Mon, Aug 16, 2021 at 4:45 PM sunyucong@xxxxxxxxx <sunyucong@xxxxxxxxx> wrote: > > > > On Mon, Aug 16, 2021 at 4:28 PM Song Liu <song@xxxxxxxxxx> wrote: > > > > > > On Mon, Aug 16, 2021 at 10:54 AM Yucong Sun <fallentree@xxxxxx> wrote: > > > > > > > > Using a fixed delay of 1ms is proven flaky in slow CPU environment, eg. github > > > > action CI system. This patch adds exponential backoff with a cap of 50ms, to > > > > reduce the flakiness of the test. > > > > > > Do we have data showing how flaky the test is before and after this change? > > > > Before the change, on 2 CPU KVM on my laptop the test is perfectly > > fine, on Github action (2 emulated CPU) , it appeared to fail roughly > > 1 in 10 runs or even more frequently. > > After the change, it appears pretty robust both on my laptop and on > > github action, I ran the github action a couple times and it succeeded > > every time. > > > > > > > > > > > > > Signed-off-by: Yucong Sun <fallentree@xxxxxx> > > > > --- > > > > tools/testing/selftests/bpf/test_maps.c | 7 ++++++- > > > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c > > > > index 14cea869235b..ed92d56c19cf 100644 > > > > --- a/tools/testing/selftests/bpf/test_maps.c > > > > +++ b/tools/testing/selftests/bpf/test_maps.c > > > > @@ -1400,11 +1400,16 @@ static void test_map_stress(void) > > > > static int map_update_retriable(int map_fd, const void *key, const void *value, > > > > int flags, int attempts) > > > > { > > > > + int delay = 1; > > > > + > > > > while (bpf_map_update_elem(map_fd, key, value, flags)) { > > > > if (!attempts || (errno != EAGAIN && errno != EBUSY)) > > > > return -errno; > > > > > > > > - usleep(1); > > > > + if (delay < 50) > > > > + delay *= 2; > > > > + > > > > + usleep(delay); > > > > > > It is a little weird that the delay times in microseconds are 2, 4, 8, > > > 16, 32, 64, 64, ... > > > Maybe just use rand()? > > > > map_update_retriable is called by test_map_update() , which is being > > parallel executed in 1024 threads, so the lock contention is > > intentional, I think if we introduce randomness in the delay it kind > > of defeats the purpose of the test. > > Not really, the purpose of the test is to test a lot of concurrent > updates with some collisions. It doesn't matter if there is one > collision or 20. We will still get an initial collision (we only > usleep() after the initial attempt fails with EAGAIN or EBUSY) even > with randomized initial sleep *after* the first collision, but after > that we should make sure that test eventually completes, so for that > random initial delay is a good thing. > > I reworked the code a bit, added initial random delay between [0ms, > 5ms), and then actually capped delay under 50ms (it can't go to > >50ms). Note also that your code is actually working with microseconds > (usleep() takes microseconds), while commit was actually talking about > milliseconds. > > Applied to bpf-next, thanks. Fun, two out of three test runs now fail with (see [0], for an example): error -16 16 test_maps: test_maps.c:1374: test_update_delete: Assertion `err == 0' failed. test_maps: test_maps.c:1312: __run_parallel: Assertion `status == 0' failed. ./libbpf/travis-ci/vmtest/run_selftests.sh: line 19: 1226 Aborted ./test_maps Seems like we need the same trick for bpf_map_delete_elem() calls?.. [0] https://github.com/kernel-patches/bpf/pull/1632/checks?check_run_id=3345916773 > > > My original proposal is to just increase the attempts to 10X , Andrii > > proposed to use an exponential back-off, which is what I ended up > > implementing. > > > > > > > > Thanks, > > > Song > > > > > > > attempts--; > > > > } > > > > > > > > -- > > > > 2.30.2 > > > >