Re: [PATCH v1 bpf] selftests/bpf: Add exponential backoff to map_update_retriable in test_maps

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Mon, 16 Aug 2021 21:15:27 -0700

On Mon, Aug 16, 2021 at 7:19 PM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Mon, Aug 16, 2021 at 4:45 PM sunyucong@xxxxxxxxx <sunyucong@xxxxxxxxx> wrote:
> >
> > On Mon, Aug 16, 2021 at 4:28 PM Song Liu <song@xxxxxxxxxx> wrote:
> > >
> > > On Mon, Aug 16, 2021 at 10:54 AM Yucong Sun <fallentree@xxxxxx> wrote:
> > > >
> > > > Using a fixed delay of 1ms is proven flaky in slow CPU environment, eg.  github
> > > > action CI system. This patch adds exponential backoff with a cap of 50ms, to
> > > > reduce the flakiness of the test.
> > >
> > > Do we have data showing how flaky the test is before and after this change?
> >
> > Before the change, on 2 CPU KVM on my laptop the test is perfectly
> > fine, on Github action (2 emulated CPU) , it appeared to fail roughly
> > 1 in 10 runs or even more frequently.
> > After the change, it appears pretty robust both on my laptop and on
> > github action, I ran the github action a couple times and it succeeded
> > every time.
> >
> > >
> > > >
> > > > Signed-off-by: Yucong Sun <fallentree@xxxxxx>
> > > > ---
> > > >  tools/testing/selftests/bpf/test_maps.c | 7 ++++++-
> > > >  1 file changed, 6 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
> > > > index 14cea869235b..ed92d56c19cf 100644
> > > > --- a/tools/testing/selftests/bpf/test_maps.c
> > > > +++ b/tools/testing/selftests/bpf/test_maps.c
> > > > @@ -1400,11 +1400,16 @@ static void test_map_stress(void)
> > > >  static int map_update_retriable(int map_fd, const void *key, const void *value,
> > > >                                 int flags, int attempts)
> > > >  {
> > > > +       int delay = 1;
> > > > +
> > > >         while (bpf_map_update_elem(map_fd, key, value, flags)) {
> > > >                 if (!attempts || (errno != EAGAIN && errno != EBUSY))
> > > >                         return -errno;
> > > >
> > > > -               usleep(1);
> > > > +               if (delay < 50)
> > > > +                       delay *= 2;
> > > > +
> > > > +               usleep(delay);
> > >
> > > It is a little weird that the delay times in microseconds are 2, 4, 8,
> > > 16, 32, 64, 64, ...
> > > Maybe just use rand()?
> >
> > map_update_retriable is called by test_map_update() , which is being
> > parallel executed in 1024 threads, so the lock contention is
> > intentional, I think if we introduce randomness in the delay it kind
> > of defeats the purpose of the test.
>
> Not really, the purpose of the test is to test a lot of concurrent
> updates with some collisions. It doesn't matter if there is one
> collision or 20. We will still get an initial collision (we only
> usleep() after the initial attempt fails with EAGAIN or EBUSY) even
> with randomized initial sleep *after* the first collision, but after
> that we should make sure that test eventually completes, so for that
> random initial delay is a good thing.
>
> I reworked the code a bit, added initial random delay between [0ms,
> 5ms), and then actually capped delay under 50ms (it can't go to
> >50ms). Note also that your code is actually working with microseconds
> (usleep() takes microseconds), while commit was actually talking about
> milliseconds.
>
> Applied to bpf-next, thanks.

Fun, two out of three test runs now fail with (see [0], for an example):

  error -16 16
  test_maps: test_maps.c:1374: test_update_delete: Assertion `err == 0' failed.
  test_maps: test_maps.c:1312: __run_parallel: Assertion `status == 0' failed.
  ./libbpf/travis-ci/vmtest/run_selftests.sh: line 19:  1226 Aborted
              ./test_maps

Seems like we need the same trick for bpf_map_delete_elem() calls?..

  [0] https://github.com/kernel-patches/bpf/pull/1632/checks?check_run_id=3345916773

>
> > My original proposal is to just increase the attempts to 10X , Andrii
> > proposed to use an exponential back-off, which is what I ended up
> > implementing.
> >
> > >
> > > Thanks,
> > > Song
> > >
> > > >                 attempts--;
> > > >         }
> > > >
> > > > --
> > > > 2.30.2
> > > >