Re: [PATCH bpf-next v3 2/8] samples: bpf: Add common infrastructure for XDP samples

Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> · Wed, 4 Aug 2021 05:02:30 +0530

On Wed, Aug 04, 2021 at 04:36:35AM IST, Daniel Borkmann wrote:
> On 7/28/21 6:55 PM, Kumar Kartikeya Dwivedi wrote:
> > This file implements some common helpers to consolidate differences in
> > features and functionality between the various XDP samples and give them
> > a consistent look, feel, and reporting capabilities.
> >
> > Some of the key features are:
> >   * A concise output format accompanied by helpful text explaining its
> >     fields.
> >   * An elaborate output format building upon the concise one, and folding
> >     out details in case of errors and staying out of view otherwise.
> >   * Extended reporting of redirect errors by capturing hits for each
> >     errno and displaying them inline (ENETDOWN, EINVAL, ENOSPC, etc.)
> >     to aid debugging.
> >   * Reporting of each xdp_exception action for all samples that use these
> >     helpers (XDP_ABORTED, etc.) to aid debugging.
> >   * Capturing per ifindex pair devmap_xmit counts for decomposing the
> >     total TX count per devmap redirection.
> >   * Ability to jump to source locations invoking tracepoints.
> >   * Faster retrieval of stats per polling interval using mmap'd eBPF
> >     array map (through .bss).
> >   * Printing driver names for devices redirecting packets.
> >   * Printing summarized total statistics for the entire session.
> >   * Ability to dynamically switch between concise and verbose mode, using
> >     SIGQUIT (Ctrl + \).
> >
> > The goal is sharing these helpers that most of the XDP samples implement
> > in some form but differently for each, lacking in some respect compared
> > to one another.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx>
>
> Overall I think it's okay to try to streamline the individual XDP tools, but I
> also tend to wonder whether we keep going beyond the original purpose of kernel
> samples where the main goal is to provide small, easy to hack & stand-alone code
> snippets (like in samples/seccomp ... no doubt we have it more complex in BPF
> land, but still); things people can take away and extend for their purpose. A big
> portion of the samples are still better off in selftests so they can be run in CI,
> and those that are not should generally be simplified for developers to rip out,
> modify, experiment, and build actual applications on top.
>
> > ---
> >   samples/bpf/Makefile            |    3 +-
> >   samples/bpf/xdp_sample_shared.h |   47 +
> >   samples/bpf/xdp_sample_user.c   | 1732 +++++++++++++++++++++++++++++++
> >   samples/bpf/xdp_sample_user.h   |   94 ++
>
> I would have wished that rather than a single patch to get all the boilerplate
> code in at once to have incremental improvements that refactor the individual
> sample tools along the way.
>

I see. I'll try to split these into smaller changes.

> >   4 files changed, 1875 insertions(+), 1 deletion(-)
> >   create mode 100644 samples/bpf/xdp_sample_shared.h
> >   create mode 100644 samples/bpf/xdp_sample_user.c
> >   create mode 100644 samples/bpf/xdp_sample_user.h
> >
> > diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> > index 036998d11ded..d8fc3e6930f9 100644
> > --- a/samples/bpf/Makefile
> > +++ b/samples/bpf/Makefile
> > @@ -62,6 +62,7 @@ LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
> >   CGROUP_HELPERS := ../../tools/testing/selftests/bpf/cgroup_helpers.o
> >   TRACE_HELPERS := ../../tools/testing/selftests/bpf/trace_helpers.o
> > +XDP_SAMPLE := xdp_sample_user.o
> >   fds_example-objs := fds_example.o
> >   sockex1-objs := sockex1_user.o
> > @@ -210,7 +211,7 @@ TPROGS_CFLAGS += --sysroot=$(SYSROOT)
> >   TPROGS_LDFLAGS := -L$(SYSROOT)/usr/lib
> >   endif
> > -TPROGS_LDLIBS			+= $(LIBBPF) -lelf -lz
> > +TPROGS_LDLIBS			+= $(LIBBPF) -lelf -lz -lm
> >   TPROGLDLIBS_tracex4		+= -lrt
> >   TPROGLDLIBS_trace_output	+= -lrt
> >   TPROGLDLIBS_map_perf_test	+= -lrt
> > diff --git a/samples/bpf/xdp_sample_shared.h b/samples/bpf/xdp_sample_shared.h
> > new file mode 100644
> > index 000000000000..0eb3c2b526d4
> > --- /dev/null
> > +++ b/samples/bpf/xdp_sample_shared.h
> > @@ -0,0 +1,47 @@
> > +#ifndef _XDP_SAMPLE_SHARED_H
> > +#define _XDP_SAMPLE_SHARED_H
> > +
> > +/*
> > + * Relaxed load/store
> > + * __atomic_load_n/__atomic_store_n built-in is not supported for BPF target
> > + */
> > +#define NO_TEAR_LOAD(var) ({ (*(volatile typeof(var) *)&(var)); })
> > +#define NO_TEAR_STORE(var, val) ({ *((volatile typeof(var) *)&(var)) = (val); })
>
> Can't we reuse READ_ONCE() / WRITE_ONCE() from tools/include/linux/compiler.h instead
> of readding the same here again? We already rely on using the tooling include infra
> anyway for samples.
>

Ok, I'll use those.

> > +/* This does a load + store instead of the expensive atomic fetch add, but store
> > + * is still atomic so that userspace reading the value reads the old or the new
> > + * one, but not a partial store.
> > + */
> > +#define NO_TEAR_ADD(var, val)                                                  \
> > +	({                                                                     \
> > +		typeof(val) __val = (val);                                     \
> > +		if (__val)                                                     \
> > +			NO_TEAR_STORE((var), (var) + __val);                   \
> > +	})
> > +
> > +#define NO_TEAR_INC(var) NO_TEAR_ADD((var), 1)
>
> This could likely also go there.
>

Ok.

> > +#define MAX_CPUS 64
>
> Given this is a heavy rework of samples, can't we also fix up this assumption?
>

Ok, I'll use CONFIG_NR_CPUS.

> > +#define ELEMENTS_OF(x) (sizeof(x) / sizeof(*(x)))
>
> See ARRAY_SIZE() under tools/include/linux/kernel.h .
>

Ok.

> > +struct datarec {
> > +	size_t processed;
> > +	size_t dropped;
> > +	size_t issue;
> > +	union {
> > +		size_t xdp_pass;
> > +		size_t info;
> > +	};
> > +	size_t xdp_drop;
> > +	size_t xdp_redirect;
>
> Why all size_t, can't we just use __u64 for the stats?
>

READ_ONCE/WRITE_ONCE can tear for __u64 on 32-bit systems, which will lead to
printing potentially incorrect output. size_t/unsigned long shouldn't.

> > +} __attribute__((aligned(64)));
> > +
> > +struct sample_data {
> > +	struct datarec rx_cnt[MAX_CPUS];
> > +	struct datarec redirect_err_cnt[7 * MAX_CPUS];
> > +	struct datarec cpumap_enqueue_cnt[MAX_CPUS * MAX_CPUS];
> > +	struct datarec cpumap_kthread_cnt[MAX_CPUS];
> > +	struct datarec exception_cnt[6 * MAX_CPUS];
> > +	struct datarec devmap_xmit_cnt[MAX_CPUS];
> > +};
> > +
> > +#endif
> > diff --git a/samples/bpf/xdp_sample_user.c b/samples/bpf/xdp_sample_user.c
> > new file mode 100644
> > index 000000000000..06efc2bfd84e
> > --- /dev/null
> > +++ b/samples/bpf/xdp_sample_user.c
> > @@ -0,0 +1,1732 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +#define _GNU_SOURCE
> > +
> > +#include <errno.h>
> > +#include <signal.h>
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <stdbool.h>
> > +#include <string.h>
> > +#include <unistd.h>
> > +#include <math.h>
> > +#include <locale.h>
> > +#include <sys/signalfd.h>
> > +#include <sys/resource.h>
> > +#include <sys/sysinfo.h>
> > +#include <sys/timerfd.h>
> > +#include <getopt.h>
> > +#include <net/if.h>
> > +#include <time.h>
> > +#include <linux/limits.h>
> > +#include <sys/ioctl.h>
> > +#include <net/if.h>
> > +#include <poll.h>
> > +#include <linux/ethtool.h>
> > +#include <linux/sockios.h>
> > +#ifndef SIOCETHTOOL
> > +#define SIOCETHTOOL 0x8946
> > +#endif
> > +#include <fcntl.h>
> > +#include <arpa/inet.h>
> > +#include <linux/if_link.h>
> > +#include <sys/utsname.h>
> > +#include <bpf/bpf.h>
> > +#include <bpf/libbpf.h>
> > +#include "bpf_util.h"
> > +#include "xdp_sample_user.h"
>
> nit: Pls order the above a bit better.
>

Ok, will do.

> > +#define __sample_print(fmt, cond, printer, ...)                                \
> > +	({                                                                     \
> > +		if (cond)                                                      \
> > +			printer(fmt, ##__VA_ARGS__);                           \
> > +	})
> > +
> > +#define print_always(fmt, ...) __sample_print(fmt, 1, printf, ##__VA_ARGS__)
> > +#define print_default(fmt, ...)                                                \
> > +	__sample_print(fmt, sample.log_level & LL_DEFAULT, printf, ##__VA_ARGS__)
> > +#define __print_err(err, fmt, printer, ...)                                    \
> > +	({                                                                     \
> > +		__sample_print(fmt, err > 0 || sample.log_level & LL_DEFAULT,  \
> > +			       printer, ##__VA_ARGS__);                        \
> > +		sample.err_exp = sample.err_exp ? true : err > 0;              \
> > +	})
> > +#define print_err(err, fmt, ...) __print_err(err, fmt, printf, ##__VA_ARGS__)
> > +
> > +#define print_link_err(err, str, width, type)                                  \
> > +	__print_err(err, str, print_link, width, type)
>
> Couldn't this all be more generalized so other non-XDP samples can use it as well
> potentially?
>

It's a bit tied into what the log level means, but I'll try.

> > +#define __COLUMN(x) "%'10" x " %-13s"
> > +#define FMT_COLUMNf __COLUMN(".0f")
> > +#define FMT_COLUMNd __COLUMN("d")
> > +#define FMT_COLUMNl __COLUMN("llu")
> > +#define RX(rx) rx, "rx/s"
> > +#define PPS(pps) pps, "pkt/s"
> > +#define DROP(drop) drop, "drop/s"
> > +#define ERR(err) err, "error/s"
> > +#define HITS(hits) hits, "hit/s"
> > +#define XMIT(xmit) xmit, "xmit/s"
> > +#define PASS(pass) pass, "pass/s"
> > +#define REDIR(redir) redir, "redir/s"
> > +#define NANOSEC_PER_SEC 1000000000 /* 10^9 */
> > +
> > +#define XDP_UNKNOWN (XDP_REDIRECT + 1)
> > +#define XDP_ACTION_MAX (XDP_UNKNOWN + 1)
> > +#define XDP_REDIRECT_ERR_MAX	7
> > +
> > +enum tp_type {
> > +	TP_REDIRECT_CNT,
> > +	TP_REDIRECT_MAP_CNT,
> > +	TP_REDIRECT_ERR_CNT,
> > +	TP_REDIRECT_MAP_ERR_CNT,
> > +	TP_CPUMAP_ENQUEUE_CNT,
> > +	TP_CPUMAP_KTHREAD_CNT,
> > +	TP_EXCEPTION_CNT,
> > +	TP_DEVMAP_XMIT_CNT,
> > +	NUM_TP,
> > +};
> > +
> > +enum map_type {
> > +	MAP_DEVMAP_XMIT_MULTI,
> > +	NUM_MAP,
> > +};
> > +
> > +enum log_level {
> > +	LL_DEFAULT = 1U << 0,
> > +	LL_SIMPLE  = 1U << 1,
> > +	LL_DEBUG   = 1U << 2,
> > +};
> > +
> > +struct record {
> > +	__u64 timestamp;
> > +	struct datarec total;
> > +	struct datarec *cpu;
> > +};
> > +
> > +struct stats_record {
> > +	struct record rx_cnt;
> > +	struct record redir_err[XDP_REDIRECT_ERR_MAX];
> > +	struct record kthread;
> > +	struct record exception[XDP_ACTION_MAX];
> > +	struct record devmap_xmit;
> > +	struct hash_map *devmap_xmit_multi;
> > +	struct record enq[];
> > +};
> > +
> > +struct sample_output {
> > +	struct {
> > +		__u64 rx;
> > +		__u64 redir;
> > +		__u64 drop;
> > +		__u64 drop_xmit;
> > +		__u64 err;
> > +		__u64 xmit;
> > +	} totals;
> > +	struct {
> > +		__u64 pps;
> > +		__u64 drop;
> > +		__u64 err;
> > +	} rx_cnt;
> > +	struct {
> > +		__u64 suc;
> > +		__u64 err;
> > +	} redir_cnt;
> > +	struct {
> > +		__u64 hits;
> > +	} except_cnt;
> > +	struct {
> > +		__u64 pps;
> > +		__u64 drop;
> > +		__u64 err;
> > +		double bavg;
> > +	} xmit_cnt;
> > +};
> > +
> > +static struct {
> > +	enum log_level log_level;
> > +	struct sample_output out;
> > +	struct sample_data *data;
> > +	unsigned long interval;
> > +	int map_fd[NUM_MAP];
> > +	struct xdp_desc {
> > +		int ifindex;
> > +		__u32 prog_id;
> > +		int flags;
> > +	} xdp_progs[32];
> > +	bool err_exp;
> > +	int xdp_cnt;
> > +	int n_cpus;
> > +	int sig_fd;
> > +	int mask;
> > +} sample;
> > +
> > +static const char *xdp_redirect_err_names[XDP_REDIRECT_ERR_MAX] = {
> > +	/* Key=1 keeps unknown errors */
> > +	"Success", "Unknown", "EINVAL", "ENETDOWN", "EMSGSIZE",
> > +	"EOPNOTSUPP", "ENOSPC",
> > +};
> > +
> > +/* Keyed from Unknown */
> > +static const char *xdp_redirect_err_help[XDP_REDIRECT_ERR_MAX - 1] = {
> > +	"Unknown error",
> > +	"Invalid redirection",
> > +	"Device being redirected to is down",
> > +	"Packet length too large for device",
> > +	"Operation not supported",
> > +	"No space in ptr_ring of cpumap kthread",
> > +};
> > +
> > +static const char *xdp_action_names[XDP_ACTION_MAX] = {
> > +	[XDP_ABORTED]	= "XDP_ABORTED",
> > +	[XDP_DROP]	= "XDP_DROP",
> > +	[XDP_PASS]	= "XDP_PASS",
> > +	[XDP_TX]	= "XDP_TX",
> > +	[XDP_REDIRECT]	= "XDP_REDIRECT",
> > +	[XDP_UNKNOWN]	= "XDP_UNKNOWN",
> > +};
> > +
> > +static const char *elixir_search[NUM_TP] = {
> > +	[TP_REDIRECT_CNT] = "_trace_xdp_redirect",
> > +	[TP_REDIRECT_MAP_CNT] = "_trace_xdp_redirect_map",
> > +	[TP_REDIRECT_ERR_CNT] = "_trace_xdp_redirect_err",
> > +	[TP_REDIRECT_MAP_ERR_CNT] = "_trace_xdp_redirect_map_err",
> > +	[TP_CPUMAP_ENQUEUE_CNT] = "trace_xdp_cpumap_enqueue",
> > +	[TP_CPUMAP_KTHREAD_CNT] = "trace_xdp_cpumap_kthread",
> > +	[TP_EXCEPTION_CNT] = "trace_xdp_exception",
> > +	[TP_DEVMAP_XMIT_CNT] = "trace_xdp_devmap_xmit",
> > +};
>
> nit: inconsistent style wrt tabs on elixir_search vs xdp_action_names, generally
> like the latter is preferred.

Ok, I'll fix all instances.

> > +/* Dumb hashmap to store ifindex pairs datarec for devmap_xmit_cnt_multi */
> > +#define HASH_BUCKETS (1 << 5)
> > +#define HASH_MASK    (HASH_BUCKETS - 1)
> > +
> > +#define hash_map_for_each(entry, map)                                          \
> > +	for (int __i = 0; __i < HASH_BUCKETS; __i++)                           \
> > +		for (entry = map->buckets[__i]; entry; entry = entry->next)
> > +
> > +struct map_entry {
> > +	struct map_entry *next;
> > +	__u64 pair;
> > +	struct record *val;
> > +};
> > +
> > +struct hash_map {
> > +	struct map_entry *buckets[HASH_BUCKETS];
> > +};
> > +
> > +/* From tools/lib/bpf/hashmap.h */
>
> Should this also have same license then?
>
> > +static size_t hash_bits(size_t h, int bits)
> > +{
> > +	/* shuffle bits and return requested number of upper bits */
> > +	if (bits == 0)
> > +		return 0;
> > +
> > +#if (__SIZEOF_SIZE_T__ == __SIZEOF_LONG_LONG__)
> > +	/* LP64 case */
> > +	return (h * 11400714819323198485llu) >> (__SIZEOF_LONG_LONG__ * 8 - bits);
> > +#elif (__SIZEOF_SIZE_T__ <= __SIZEOF_LONG__)
> > +	return (h * 2654435769lu) >> (__SIZEOF_LONG__ * 8 - bits);
> > +#else
> > +#	error "Unsupported size_t size"
> > +#endif
> > +}
> > +
> > +static __u64 gettime(void)
> > +{
> > +	struct timespec t;
> > +	int res;
> > +
> > +	res = clock_gettime(CLOCK_MONOTONIC, &t);
> > +	if (res < 0) {
> > +		fprintf(stderr, "Error with gettimeofday! (%i)\n", res);
> > +		exit(EXIT_FAIL);
> > +	}
> > +	return (__u64) t.tv_sec * NANOSEC_PER_SEC + t.tv_nsec;
> > +}
> > +
> > +int sample_get_cpus(void)
> > +{
> > +	int cpus = get_nprocs_conf();
> > +	return cpus < MAX_CPUS ? cpus : MAX_CPUS;
> > +}
> > +
> > +static int hash_map_add(struct hash_map *map, __u64 pair, struct record *val)
> > +{
> > +	struct map_entry *e, **b;
> > +
> > +	e = calloc(1, sizeof(*e));
> > +	if (!e)
> > +		return -ENOMEM;
> > +
> > +	val->timestamp = gettime();
> > +
> > +	e->pair = pair;
> > +	e->val = val;
> > +
> > +	b = &map->buckets[hash_bits((size_t)pair, 5) & HASH_MASK];
> > +	if (*b)
> > +		e->next = *b;
> > +	*b = e;
> > +
> > +	return 0;
> > +}
> > +
> > +static struct map_entry *hash_map_find(struct hash_map *map, __u64 pair)
> > +{
> > +	struct map_entry *e;
> > +
> > +	e = map->buckets[hash_bits((size_t)pair, 5) & HASH_MASK];
> > +	for (; e; e = e->next) {
> > +		if (e->pair == pair)
> > +			break;
> > +	}
> > +
> > +	return e;
> > +}
> > +
> > +static void hash_map_destroy(struct hash_map *map)
> > +{
> > +	if (!map)
> > +		return;
> > +
> > +	for (int i = 0; i < HASH_BUCKETS; i++) {
> > +		struct map_entry *e, *f;
> > +
> > +		e = map->buckets[i];
> > +		while (e) {
> > +			f = e;
> > +			e = e->next;
> > +			free(f->val->cpu);
> > +			free(f->val);
> > +			free(f);
> > +		}
> > +	}
> > +}
>
> There's also an existing hashtable implementation under
> tools/include/linux/hashtable.h, can't we reuse that one?
>

I didn't notice this. I'll switch to using it.

> [...]
> > +static const char *make_url(enum tp_type i)
> > +{
> > +	const char *key = elixir_search[i];
> > +	static struct utsname uts = {};
> > +	static char url[128];
> > +	static bool uts_init;
> > +	int maj, min;
> > +	char c[2];
> > +
> > +	if (!uts_init) {
> > +		if (uname(&uts) < 0)
> > +			return NULL;
> > +		uts_init = true;
> > +	}
> > +
> > +	if (!key || sscanf(uts.release, "%d.%d%1s", &maj, &min, c) != 3)
> > +		return NULL;
> > +
> > +	snprintf(url, sizeof(url), "https://elixir.bootlin.com/linux/v%d.%d/C/ident/%s";,
> > +		 maj, min, key);
> > +
> > +	return url;
> > +}
>
> I don't think this whole make_url() belongs in here with the link construction to
> elixir.bootlin.com, it's just confusing when you have downstream trees with custom
> backports.
>

Right, it only considers the upstream kernel. I guess I'll just drop it.

[...]

I'll also address everything else, thanks!

--
Kartikeya