On Mon, Aug 16, 2021 at 4:45 PM sunyucong@xxxxxxxxx <sunyucong@xxxxxxxxx> wrote: > > On Mon, Aug 16, 2021 at 4:28 PM Song Liu <song@xxxxxxxxxx> wrote: > > > > On Mon, Aug 16, 2021 at 10:54 AM Yucong Sun <fallentree@xxxxxx> wrote: > > > > > > Using a fixed delay of 1ms is proven flaky in slow CPU environment, eg. github > > > action CI system. This patch adds exponential backoff with a cap of 50ms, to > > > reduce the flakiness of the test. > > > > Do we have data showing how flaky the test is before and after this change? > > Before the change, on 2 CPU KVM on my laptop the test is perfectly > fine, on Github action (2 emulated CPU) , it appeared to fail roughly > 1 in 10 runs or even more frequently. > After the change, it appears pretty robust both on my laptop and on > github action, I ran the github action a couple times and it succeeded > every time. > > > > > > > > > Signed-off-by: Yucong Sun <fallentree@xxxxxx> > > > --- > > > tools/testing/selftests/bpf/test_maps.c | 7 ++++++- > > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > > > diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c > > > index 14cea869235b..ed92d56c19cf 100644 > > > --- a/tools/testing/selftests/bpf/test_maps.c > > > +++ b/tools/testing/selftests/bpf/test_maps.c > > > @@ -1400,11 +1400,16 @@ static void test_map_stress(void) > > > static int map_update_retriable(int map_fd, const void *key, const void *value, > > > int flags, int attempts) > > > { > > > + int delay = 1; > > > + > > > while (bpf_map_update_elem(map_fd, key, value, flags)) { > > > if (!attempts || (errno != EAGAIN && errno != EBUSY)) > > > return -errno; > > > > > > - usleep(1); > > > + if (delay < 50) > > > + delay *= 2; > > > + > > > + usleep(delay); > > > > It is a little weird that the delay times in microseconds are 2, 4, 8, > > 16, 32, 64, 64, ... > > Maybe just use rand()? > > map_update_retriable is called by test_map_update() , which is being > parallel executed in 1024 threads, so the lock contention is > intentional, I think if we introduce randomness in the delay it kind > of defeats the purpose of the test. Not really, the purpose of the test is to test a lot of concurrent updates with some collisions. It doesn't matter if there is one collision or 20. We will still get an initial collision (we only usleep() after the initial attempt fails with EAGAIN or EBUSY) even with randomized initial sleep *after* the first collision, but after that we should make sure that test eventually completes, so for that random initial delay is a good thing. I reworked the code a bit, added initial random delay between [0ms, 5ms), and then actually capped delay under 50ms (it can't go to >50ms). Note also that your code is actually working with microseconds (usleep() takes microseconds), while commit was actually talking about milliseconds. Applied to bpf-next, thanks. > My original proposal is to just increase the attempts to 10X , Andrii > proposed to use an exponential back-off, which is what I ended up > implementing. > > > > > Thanks, > > Song > > > > > attempts--; > > > } > > > > > > -- > > > 2.30.2 > > >