Re: [PATCH bpf-next] Improve BPF test stability (related to perf events and scheduling)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Feb 22, 2022 at 12:35 PM Mykola Lysenko <mykolal@xxxxxx> wrote:
>
> Thanks for the review Andrii!
>
> > On Feb 19, 2022, at 8:39 PM, Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote:
> >
> > On Fri, Feb 18, 2022 at 4:30 PM Mykola Lysenko <mykolal@xxxxxx> wrote:
> >>
> >> In send_signal, replace sleep with dummy cpu intensive computation
> >> to increase probability of child process being scheduled. Add few
> >> more asserts.
> >>
> >> In find_vma, reduce sample_freq as higher values may be rejected in
> >> some qemu setups, remove usleep and increase length of cpu intensive
> >> computation.
> >>
> >> In bpf_cookie, perf_link and perf_branches, reduce sample_freq as
> >> higher values may be rejected in some qemu setups
> >>
> >> Signed-off-by: Mykola Lysenko <mykolal@xxxxxx>
> >> ---
> >> .../testing/selftests/bpf/prog_tests/bpf_cookie.c  |  2 +-
> >> tools/testing/selftests/bpf/prog_tests/find_vma.c  |  5 ++---
> >> .../selftests/bpf/prog_tests/perf_branches.c       |  4 ++--
> >> tools/testing/selftests/bpf/prog_tests/perf_link.c |  2 +-
> >> .../testing/selftests/bpf/prog_tests/send_signal.c | 14 ++++++++++----
> >> 5 files changed, 16 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_cookie.c b/tools/testing/selftests/bpf/prog_tests/bpf_cookie.c
> >> index cd10df6cd0fc..0612e79a9281 100644
> >> --- a/tools/testing/selftests/bpf/prog_tests/bpf_cookie.c
> >> +++ b/tools/testing/selftests/bpf/prog_tests/bpf_cookie.c
> >> @@ -199,7 +199,7 @@ static void pe_subtest(struct test_bpf_cookie *skel)
> >>        attr.type = PERF_TYPE_SOFTWARE;
> >>        attr.config = PERF_COUNT_SW_CPU_CLOCK;
> >>        attr.freq = 1;
> >> -       attr.sample_freq = 4000;
> >> +       attr.sample_freq = 1000;
> >>        pfd = syscall(__NR_perf_event_open, &attr, -1, 0, -1, PERF_FLAG_FD_CLOEXEC);
> >>        if (!ASSERT_GE(pfd, 0, "perf_fd"))
> >>                goto cleanup;
> >> diff --git a/tools/testing/selftests/bpf/prog_tests/find_vma.c b/tools/testing/selftests/bpf/prog_tests/find_vma.c
> >> index b74b3c0c555a..acc41223a112 100644
> >> --- a/tools/testing/selftests/bpf/prog_tests/find_vma.c
> >> +++ b/tools/testing/selftests/bpf/prog_tests/find_vma.c
> >> @@ -30,7 +30,7 @@ static int open_pe(void)
> >>        attr.type = PERF_TYPE_HARDWARE;
> >>        attr.config = PERF_COUNT_HW_CPU_CYCLES;
> >>        attr.freq = 1;
> >> -       attr.sample_freq = 4000;
> >> +       attr.sample_freq = 1000;
> >>        pfd = syscall(__NR_perf_event_open, &attr, 0, -1, -1, PERF_FLAG_FD_CLOEXEC);
> >>
> >>        return pfd >= 0 ? pfd : -errno;
> >> @@ -57,7 +57,7 @@ static void test_find_vma_pe(struct find_vma *skel)
> >>        if (!ASSERT_OK_PTR(link, "attach_perf_event"))
> >>                goto cleanup;
> >>
> >> -       for (i = 0; i < 1000000; ++i)
> >> +       for (i = 0; i < 1000000000; ++i)
> >
> > 1bln seems excessive... maybe 10mln would be enough?
>
> See explanation for send_signal test case below
>
> >
> >>                ++j;
> >>
> >>        test_and_reset_skel(skel, -EBUSY /* in nmi, irq_work is busy */);
> >
> > [...]
> >
> >> diff --git a/tools/testing/selftests/bpf/prog_tests/send_signal.c b/tools/testing/selftests/bpf/prog_tests/send_signal.c
> >> index 776916b61c40..841217bd1df6 100644
> >> --- a/tools/testing/selftests/bpf/prog_tests/send_signal.c
> >> +++ b/tools/testing/selftests/bpf/prog_tests/send_signal.c
> >> @@ -4,11 +4,12 @@
> >> #include <sys/resource.h>
> >> #include "test_send_signal_kern.skel.h"
> >>
> >> -int sigusr1_received = 0;
> >> +int sigusr1_received;
> >> +volatile int volatile_variable;
> >
> > please make them static
>
> sure
>
> >
> >>
> >> static void sigusr1_handler(int signum)
> >> {
> >> -       sigusr1_received++;
> >> +       sigusr1_received = 1;
> >> }
> >>
> >> static void test_send_signal_common(struct perf_event_attr *attr,
> >> @@ -42,7 +43,9 @@ static void test_send_signal_common(struct perf_event_attr *attr,
> >>                int old_prio;
> >>
> >>                /* install signal handler and notify parent */
> >> +               errno = 0;
> >>                signal(SIGUSR1, sigusr1_handler);
> >> +               ASSERT_OK(errno, "signal");
> >
> > just ASSERT_OK(signal(...), "signal");
>
> I am fine to merge signal and ASSERT lines, but will substitute with condition "signal(SIGUSR1, sigusr1_handler) != SIG_ERR”, sounds good?
>

Ah, signal is a bit special with return values. Yeah,
ASSERT_NEQ(signal(...), SIG_ERR, "signal") sounds good.

> >
> >>
> >>                close(pipe_c2p[0]); /* close read */
> >>                close(pipe_p2c[1]); /* close write */
> >> @@ -63,9 +66,12 @@ static void test_send_signal_common(struct perf_event_attr *attr,
> >>                ASSERT_EQ(read(pipe_p2c[0], buf, 1), 1, "pipe_read");
> >>
> >>                /* wait a little for signal handler */
> >> -               sleep(1);
> >> +               for (int i = 0; i < 1000000000; i++)
> >
> > same about 1bln
>
> With 10mln and 100 test runs I got 86 failures
> 100mln - 63 failures
> 1bln - 0 failures on 100 runs
>
> Now, there is performance concern for this test. Running
>
> time sudo  ./test_progs -t send_signal/send_signal_nmi_thread
>
> With 1bln takes ~4s
> 100mln - 1s.
> Unchanged test with sleep(1); takes ~2s.
>
> On the other hand 300mln runs ~2s, and only fails 1 time per 100 runs. As 300mln does not regress performance comparing to the current “sleep(1)” implementation, I propose to go with it. What do you think?


I think if we need to burn multiple seconds of CPU to make the test
reliable, then we should either rework or disable/remove the test. In
CI those billions of iterations will be much slower. And even waiting
for 4 seconds for just one test is painful.

Yonghong, WDYT? Should we just drop thi test? It has caused us a bunch
of flakiness and maintenance burden without actually catching any
issues. Maybe it's better to just get rid of it?

>
> >
> >> +                       volatile_variable++;
> >>
> >>                buf[0] = sigusr1_received ? '2' : '0';
> >> +               ASSERT_EQ(sigusr1_received, 1, "sigusr1_received");
> >> +
> >>                ASSERT_EQ(write(pipe_c2p[1], buf, 1), 1, "pipe_write");
> >>
> >>                /* wait for parent notification and exit */
> >> @@ -110,9 +116,9 @@ static void test_send_signal_common(struct perf_event_attr *attr,
> >>        ASSERT_EQ(read(pipe_c2p[0], buf, 1), 1, "pipe_read");
> >>
> >>        /* trigger the bpf send_signal */
> >> +       skel->bss->signal_thread = signal_thread;
> >>        skel->bss->pid = pid;
> >>        skel->bss->sig = SIGUSR1;
> >> -       skel->bss->signal_thread = signal_thread;
> >>
> >>        /* notify child that bpf program can send_signal now */
> >>        ASSERT_EQ(write(pipe_p2c[1], buf, 1), 1, "pipe_write");
> >> --
> >> 2.30.2
>




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux