Re: [PATCH v4 bpf-next 3/3] selftests/bpf: Add selftests for cpumask iter

Yafang Shao <laoar.shao@xxxxxxxxx> · Wed, 24 Jan 2024 17:48:38 +0800

On Wed, Jan 24, 2024 at 4:48 AM David Vernet <void@xxxxxxxxxxxxx> wrote:
>
> On Tue, Jan 23, 2024 at 11:27:16PM +0800, Yafang Shao wrote:
> > Within the BPF program, we leverage the cgroup iterator to iterate through
> > percpu runqueue data, specifically the 'nr_running' metric. Subsequently
> >  we expose this data to userspace by means of a sequence file.
> >
> > The CPU affinity for the cpumask is determined by the PID of a task:
> >
> > - PID of the init task (PID 1)
> >   We typically don't set CPU affinity for init task and thus we can iterate
> >   across all possible CPUs using the init task. However, in scenarios where
> >   you've set CPU affinity for the init task, you should set your
> >   current-task's cpu affinity to all possible CPUs and then proceed to
> >   iterate through all possible CPUs using the current task.
> > - PID of a task with defined CPU affinity
> >   The aim here is to iterate through a specific cpumask. This scenario
> >   aligns with tasks residing within a cpuset cgroup.
> > - Invalid PID (e.g., PID -1)
> >   No cpumask is available in this case.
> >
> > The result as follows,
> >   #65/1    cpumask_iter/init_pid:OK
> >   #65/2    cpumask_iter/invalid_pid:OK
> >   #65/3    cpumask_iter/self_pid_one_cpu:OK
> >   #65/4    cpumask_iter/self_pid_multi_cpus:OK
> >   #65      cpumask_iter:OK
> >   Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED
> >
> > CONFIG_PSI=y is required for this testcase.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx>
> > ---
> >  tools/testing/selftests/bpf/config            |   1 +
> >  .../selftests/bpf/prog_tests/cpumask_iter.c   | 130 ++++++++++++++++++
> >  .../selftests/bpf/progs/cpumask_common.h      |   3 +
> >  .../selftests/bpf/progs/test_cpumask_iter.c   |  56 ++++++++
> >  4 files changed, 190 insertions(+)
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/cpumask_iter.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/test_cpumask_iter.c
> >
> > diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
> > index c125c441abc7..9c42568ed376 100644
> > --- a/tools/testing/selftests/bpf/config
> > +++ b/tools/testing/selftests/bpf/config
> > @@ -78,6 +78,7 @@ CONFIG_NF_CONNTRACK_MARK=y
> >  CONFIG_NF_DEFRAG_IPV4=y
> >  CONFIG_NF_DEFRAG_IPV6=y
> >  CONFIG_NF_NAT=y
> > +CONFIG_PSI=y
> >  CONFIG_RC_CORE=y
> >  CONFIG_SECURITY=y
> >  CONFIG_SECURITYFS=y
> > diff --git a/tools/testing/selftests/bpf/prog_tests/cpumask_iter.c b/tools/testing/selftests/bpf/prog_tests/cpumask_iter.c
> > new file mode 100644
> > index 000000000000..1db4efc57c5f
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/cpumask_iter.c
> > @@ -0,0 +1,130 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright (c) 2024 Yafang Shao <laoar.shao@xxxxxxxxx> */
> > +
> > +#define _GNU_SOURCE
> > +#include <sched.h>
> > +#include <stdio.h>
> > +#include <unistd.h>
> > +
> > +#include <test_progs.h>
> > +#include "cgroup_helpers.h"
> > +#include "test_cpumask_iter.skel.h"
> > +
> > +static void verify_percpu_data(struct bpf_link *link, int nr_cpu_exp, int nr_running_exp)
>
> In general it seems wrong for this to be void. If we fail to verify
> something, why would we want to keep chugging along in the caller? That
> seems like it would just add a bunch of test failure noise for all the
> stuff that fails after.

makes sense.

>
> > +{
> > +     int iter_fd, len, item, nr_running, psi_running, nr_cpus;
> > +     char buf[128];
> > +     size_t left;
> > +     char *p;
> > +
> > +     iter_fd = bpf_iter_create(bpf_link__fd(link));
> > +     if (!ASSERT_GE(iter_fd, 0, "iter_fd"))
> > +             return;
> > +
> > +     memset(buf, 0, sizeof(buf));
> > +     left = ARRAY_SIZE(buf);
> > +     p = buf;
> > +     while ((len = read(iter_fd, p, left)) > 0) {
> > +             p += len;
> > +             left -= len;
> > +     }
> > +
> > +     item = sscanf(buf, "nr_running %u nr_cpus %u psi_running %u\n",
> > +                   &nr_running, &nr_cpus, &psi_running);
>
> I don't think there's any reason we should need to do string parsing
> like this. We can set all of these variables from BPF and then check
> them in the skeleton from user space.

I agree with your point about the simplification achieved through
skeleton variables. However, it's worth noting that a seq file in this
context serves to demonstrate how we expose the data through a seq
file.

>
> > +     if (nr_cpu_exp == -1) {
> > +             ASSERT_EQ(item, -1, "seq_format");
> > +             goto out;
> > +     }
> > +
> > +     ASSERT_EQ(item, 3, "seq_format");
> > +     ASSERT_GE(nr_running, nr_running_exp, "nr_running");
> > +     ASSERT_GE(psi_running, nr_running_exp, "psi_running");
> > +     ASSERT_EQ(nr_cpus, nr_cpu_exp, "nr_cpus");
> > +
> > +out:
> > +     close(iter_fd);
> > +}
> > +
> > +void test_cpumask_iter(void)
>
> We should also have negative testcases that verify that we fail to load
> with bogus / unsafe inputs. See e.g. [0].

will add it.

>
> [0]: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/cpumask_failure.c
>
> > +{
> > +     DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
> > +     int nr_possible, cgrp_fd, pid, err, cnt, i;
> > +     struct test_cpumask_iter *skel;
> > +     union bpf_iter_link_info linfo;
> > +     int cpu_ids[] = {1, 3, 4, 5};
> > +     struct bpf_link *link;
> > +     cpu_set_t set;
> > +
> > +     skel = test_cpumask_iter__open_and_load();
> > +     if (!ASSERT_OK_PTR(skel, "test_for_each_cpu__open_and_load"))
> > +             return;
> > +
> > +     if (setup_cgroup_environment())
> > +             goto destroy;
> > +
> > +     /* Utilize the cgroup iter */
> > +     cgrp_fd = get_root_cgroup();
> > +     if (!ASSERT_GE(cgrp_fd, 0, "create cgrp"))
> > +             goto cleanup;
> > +
> > +     memset(&linfo, 0, sizeof(linfo));
> > +     linfo.cgroup.cgroup_fd = cgrp_fd;
> > +     linfo.cgroup.order = BPF_CGROUP_ITER_SELF_ONLY;
> > +     opts.link_info = &linfo;
> > +     opts.link_info_len = sizeof(linfo);
> > +
> > +     link = bpf_program__attach_iter(skel->progs.cpu_cgroup, &opts);
> > +     if (!ASSERT_OK_PTR(link, "attach_iter"))
> > +             goto close_fd;
> > +
> > +     skel->bss->target_pid = 1;
> > +     /* In case init task is set CPU affinity */
> > +     err = sched_getaffinity(1, sizeof(set), &set);
> > +     if (!ASSERT_OK(err, "setaffinity"))
> > +             goto free_link;
> > +
> > +     cnt = CPU_COUNT(&set);
> > +     nr_possible = bpf_num_possible_cpus();
> > +     if (test__start_subtest("init_pid"))
>
> Can you instead please do what we do in [1] and have each testcase be
> self contained and separately invoked? Doing things the way you have it
> in this patch has a few drawbacks:
>
> 1. The testcases are now all interdependent. If one fails we continue
>    onto others even though we may or may not have any reason to believe
>    that the system state is sane or what we expect. If the later
>    testcases fail, what does it even mean? Or if they pass, is that
>    expected? We should be setting up and tearing down whatever state we
>    need between testcases to have confidence that the signal from each
>    testcase is independent of whatever signal was in the previous one.
>
> 2. This is also confusing because we're doing a bunch of validation and
>    testing even if test__start_subtest() returns false.  It's unclear
>    what the expecations are in such a case, and I think it's a lot more
>    logical if you just don't do anything and skip the testcase.
>
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/prog_tests/cpumask.c#n8

Will check it. Appreciate your insights. Thank you.

>
> > +             /* current task is running. */
> > +             verify_percpu_data(link, cnt, cnt == nr_possible ? 1 : 0);
> > +
> > +     skel->bss->target_pid = -1;
> > +     if (test__start_subtest("invalid_pid"))
> > +             verify_percpu_data(link, -1, -1);
> > +
> > +     pid = getpid();
> > +     skel->bss->target_pid = pid;
> > +     CPU_ZERO(&set);
> > +     CPU_SET(0, &set);
> > +     err = sched_setaffinity(pid, sizeof(set), &set);
> > +     if (!ASSERT_OK(err, "setaffinity"))
> > +             goto free_link;
> > +
> > +     if (test__start_subtest("self_pid_one_cpu"))
> > +             verify_percpu_data(link, 1, 1);
> > +
> > +     /* Assume there are at least 8 CPUs on the testbed */
> > +     if (nr_possible < 8)
> > +             goto free_link;
> > +
> > +     CPU_ZERO(&set);
> > +     /* Set the CPU affinitiy: 1,3-5 */
> > +     for (i = 0; i < ARRAY_SIZE(cpu_ids); i++)
> > +             CPU_SET(cpu_ids[i], &set);
> > +     err = sched_setaffinity(pid, sizeof(set), &set);
> > +     if (!ASSERT_OK(err, "setaffinity"))
> > +             goto free_link;
> > +
> > +     if (test__start_subtest("self_pid_multi_cpus"))
> > +             verify_percpu_data(link, ARRAY_SIZE(cpu_ids), 1);
> > +
> > +free_link:
> > +     bpf_link__destroy(link);
> > +close_fd:
> > +     close(cgrp_fd);
> > +cleanup:
> > +     cleanup_cgroup_environment();
> > +destroy:
> > +     test_cpumask_iter__destroy(skel);
> > +}
> > diff --git a/tools/testing/selftests/bpf/progs/cpumask_common.h b/tools/testing/selftests/bpf/progs/cpumask_common.h
> > index 0cd4aebb97cf..cdb9dc95e9d9 100644
> > --- a/tools/testing/selftests/bpf/progs/cpumask_common.h
> > +++ b/tools/testing/selftests/bpf/progs/cpumask_common.h
> > @@ -55,6 +55,9 @@ void bpf_cpumask_copy(struct bpf_cpumask *dst, const struct cpumask *src) __ksym
> >  u32 bpf_cpumask_any_distribute(const struct cpumask *src) __ksym;
> >  u32 bpf_cpumask_any_and_distribute(const struct cpumask *src1, const struct cpumask *src2) __ksym;
> >  u32 bpf_cpumask_weight(const struct cpumask *cpumask) __ksym;
> > +int bpf_iter_cpumask_new(struct bpf_iter_cpumask *it, const struct cpumask *mask) __ksym;
> > +int *bpf_iter_cpumask_next(struct bpf_iter_cpumask *it) __ksym;
> > +void bpf_iter_cpumask_destroy(struct bpf_iter_cpumask *it) __ksym;
> >
> >  void bpf_rcu_read_lock(void) __ksym;
> >  void bpf_rcu_read_unlock(void) __ksym;
> > diff --git a/tools/testing/selftests/bpf/progs/test_cpumask_iter.c b/tools/testing/selftests/bpf/progs/test_cpumask_iter.c
> > new file mode 100644
> > index 000000000000..cb8b8359516b
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/test_cpumask_iter.c
> > @@ -0,0 +1,56 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/* Copyright (c) 2024 Yafang Shao <laoar.shao@xxxxxxxxx> */
> > +
> > +#include "vmlinux.h"
> > +#include <bpf/bpf_helpers.h>
> > +#include <bpf/bpf_tracing.h>
> > +
> > +#include "task_kfunc_common.h"
> > +#include "cpumask_common.h"
> > +
> > +extern const struct psi_group_cpu system_group_pcpu __ksym __weak;
> > +extern const struct rq runqueues __ksym __weak;
> > +
> > +int target_pid;
> > +
> > +SEC("iter.s/cgroup")
> > +int BPF_PROG(cpu_cgroup, struct bpf_iter_meta *meta, struct cgroup *cgrp)
> > +{
> > +     u32 nr_running = 0, psi_nr_running = 0, nr_cpus = 0;
> > +     struct psi_group_cpu *groupc;
> > +     struct task_struct *p;
> > +     struct rq *rq;
> > +     int *cpu;
> > +
> > +     /* epilogue */
> > +     if (cgrp == NULL)
> > +             return 0;
> > +
> > +     bpf_rcu_read_lock();
>
> Why is this required? p->cpus_ptr should be trusted given that @p is
> trusted, no? Does this fail to load if you don't have all of this in an
> RCU read region?

It is a sleepable prog iter.s, so we must explicitly add a rcu lock,
otherwise the load will fail.

>
> > +     p = bpf_task_from_pid(target_pid);
> > +     if (!p) {
> > +             bpf_rcu_read_unlock();
> > +             return 1;
> > +     }
> > +
> > +     bpf_for_each(cpumask, cpu, p->cpus_ptr) {
> > +             rq = (struct rq *)bpf_per_cpu_ptr(&runqueues, *cpu);
> > +             if (!rq)
> > +                     continue;
>
> If this happens we should set an error code and check it in user space.

sure, will do it.

>
> > +             nr_running += rq->nr_running;
> > +             nr_cpus += 1;
> > +
> > +             groupc = (struct psi_group_cpu *)bpf_per_cpu_ptr(&system_group_pcpu, *cpu);
> > +             if (!groupc)
> > +                     continue;
>
> Same here.

will do it.

>
> > +             psi_nr_running += groupc->tasks[NR_RUNNING];
> > +     }
> > +     BPF_SEQ_PRINTF(meta->seq, "nr_running %u nr_cpus %u psi_running %u\n",
> > +                    nr_running, nr_cpus, psi_nr_running);
> > +
> > +     bpf_task_release(p);
> > +     bpf_rcu_read_unlock();
> > +     return 0;
> > +}
> > +
> > +char _license[] SEC("license") = "GPL";
>
> Can you please move this to the top of the file?

will do it.

-- 
Regards
Yafang