On Wed, Aug 04, 2021 at 04:36:35AM IST, Daniel Borkmann wrote: > On 7/28/21 6:55 PM, Kumar Kartikeya Dwivedi wrote: > > This file implements some common helpers to consolidate differences in > > features and functionality between the various XDP samples and give them > > a consistent look, feel, and reporting capabilities. > > > > Some of the key features are: > > * A concise output format accompanied by helpful text explaining its > > fields. > > * An elaborate output format building upon the concise one, and folding > > out details in case of errors and staying out of view otherwise. > > * Extended reporting of redirect errors by capturing hits for each > > errno and displaying them inline (ENETDOWN, EINVAL, ENOSPC, etc.) > > to aid debugging. > > * Reporting of each xdp_exception action for all samples that use these > > helpers (XDP_ABORTED, etc.) to aid debugging. > > * Capturing per ifindex pair devmap_xmit counts for decomposing the > > total TX count per devmap redirection. > > * Ability to jump to source locations invoking tracepoints. > > * Faster retrieval of stats per polling interval using mmap'd eBPF > > array map (through .bss). > > * Printing driver names for devices redirecting packets. > > * Printing summarized total statistics for the entire session. > > * Ability to dynamically switch between concise and verbose mode, using > > SIGQUIT (Ctrl + \). > > > > The goal is sharing these helpers that most of the XDP samples implement > > in some form but differently for each, lacking in some respect compared > > to one another. > > > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> > > Overall I think it's okay to try to streamline the individual XDP tools, but I > also tend to wonder whether we keep going beyond the original purpose of kernel > samples where the main goal is to provide small, easy to hack & stand-alone code > snippets (like in samples/seccomp ... no doubt we have it more complex in BPF > land, but still); things people can take away and extend for their purpose. A big > portion of the samples are still better off in selftests so they can be run in CI, > and those that are not should generally be simplified for developers to rip out, > modify, experiment, and build actual applications on top. > > > --- > > samples/bpf/Makefile | 3 +- > > samples/bpf/xdp_sample_shared.h | 47 + > > samples/bpf/xdp_sample_user.c | 1732 +++++++++++++++++++++++++++++++ > > samples/bpf/xdp_sample_user.h | 94 ++ > > I would have wished that rather than a single patch to get all the boilerplate > code in at once to have incremental improvements that refactor the individual > sample tools along the way. > I see. I'll try to split these into smaller changes. > > 4 files changed, 1875 insertions(+), 1 deletion(-) > > create mode 100644 samples/bpf/xdp_sample_shared.h > > create mode 100644 samples/bpf/xdp_sample_user.c > > create mode 100644 samples/bpf/xdp_sample_user.h > > > > diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile > > index 036998d11ded..d8fc3e6930f9 100644 > > --- a/samples/bpf/Makefile > > +++ b/samples/bpf/Makefile > > @@ -62,6 +62,7 @@ LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a > > CGROUP_HELPERS := ../../tools/testing/selftests/bpf/cgroup_helpers.o > > TRACE_HELPERS := ../../tools/testing/selftests/bpf/trace_helpers.o > > +XDP_SAMPLE := xdp_sample_user.o > > fds_example-objs := fds_example.o > > sockex1-objs := sockex1_user.o > > @@ -210,7 +211,7 @@ TPROGS_CFLAGS += --sysroot=$(SYSROOT) > > TPROGS_LDFLAGS := -L$(SYSROOT)/usr/lib > > endif > > -TPROGS_LDLIBS += $(LIBBPF) -lelf -lz > > +TPROGS_LDLIBS += $(LIBBPF) -lelf -lz -lm > > TPROGLDLIBS_tracex4 += -lrt > > TPROGLDLIBS_trace_output += -lrt > > TPROGLDLIBS_map_perf_test += -lrt > > diff --git a/samples/bpf/xdp_sample_shared.h b/samples/bpf/xdp_sample_shared.h > > new file mode 100644 > > index 000000000000..0eb3c2b526d4 > > --- /dev/null > > +++ b/samples/bpf/xdp_sample_shared.h > > @@ -0,0 +1,47 @@ > > +#ifndef _XDP_SAMPLE_SHARED_H > > +#define _XDP_SAMPLE_SHARED_H > > + > > +/* > > + * Relaxed load/store > > + * __atomic_load_n/__atomic_store_n built-in is not supported for BPF target > > + */ > > +#define NO_TEAR_LOAD(var) ({ (*(volatile typeof(var) *)&(var)); }) > > +#define NO_TEAR_STORE(var, val) ({ *((volatile typeof(var) *)&(var)) = (val); }) > > Can't we reuse READ_ONCE() / WRITE_ONCE() from tools/include/linux/compiler.h instead > of readding the same here again? We already rely on using the tooling include infra > anyway for samples. > Ok, I'll use those. > > +/* This does a load + store instead of the expensive atomic fetch add, but store > > + * is still atomic so that userspace reading the value reads the old or the new > > + * one, but not a partial store. > > + */ > > +#define NO_TEAR_ADD(var, val) \ > > + ({ \ > > + typeof(val) __val = (val); \ > > + if (__val) \ > > + NO_TEAR_STORE((var), (var) + __val); \ > > + }) > > + > > +#define NO_TEAR_INC(var) NO_TEAR_ADD((var), 1) > > This could likely also go there. > Ok. > > +#define MAX_CPUS 64 > > Given this is a heavy rework of samples, can't we also fix up this assumption? > Ok, I'll use CONFIG_NR_CPUS. > > +#define ELEMENTS_OF(x) (sizeof(x) / sizeof(*(x))) > > See ARRAY_SIZE() under tools/include/linux/kernel.h . > Ok. > > +struct datarec { > > + size_t processed; > > + size_t dropped; > > + size_t issue; > > + union { > > + size_t xdp_pass; > > + size_t info; > > + }; > > + size_t xdp_drop; > > + size_t xdp_redirect; > > Why all size_t, can't we just use __u64 for the stats? > READ_ONCE/WRITE_ONCE can tear for __u64 on 32-bit systems, which will lead to printing potentially incorrect output. size_t/unsigned long shouldn't. > > +} __attribute__((aligned(64))); > > + > > +struct sample_data { > > + struct datarec rx_cnt[MAX_CPUS]; > > + struct datarec redirect_err_cnt[7 * MAX_CPUS]; > > + struct datarec cpumap_enqueue_cnt[MAX_CPUS * MAX_CPUS]; > > + struct datarec cpumap_kthread_cnt[MAX_CPUS]; > > + struct datarec exception_cnt[6 * MAX_CPUS]; > > + struct datarec devmap_xmit_cnt[MAX_CPUS]; > > +}; > > + > > +#endif > > diff --git a/samples/bpf/xdp_sample_user.c b/samples/bpf/xdp_sample_user.c > > new file mode 100644 > > index 000000000000..06efc2bfd84e > > --- /dev/null > > +++ b/samples/bpf/xdp_sample_user.c > > @@ -0,0 +1,1732 @@ > > +// SPDX-License-Identifier: GPL-2.0-only > > +#define _GNU_SOURCE > > + > > +#include <errno.h> > > +#include <signal.h> > > +#include <stdio.h> > > +#include <stdlib.h> > > +#include <stdbool.h> > > +#include <string.h> > > +#include <unistd.h> > > +#include <math.h> > > +#include <locale.h> > > +#include <sys/signalfd.h> > > +#include <sys/resource.h> > > +#include <sys/sysinfo.h> > > +#include <sys/timerfd.h> > > +#include <getopt.h> > > +#include <net/if.h> > > +#include <time.h> > > +#include <linux/limits.h> > > +#include <sys/ioctl.h> > > +#include <net/if.h> > > +#include <poll.h> > > +#include <linux/ethtool.h> > > +#include <linux/sockios.h> > > +#ifndef SIOCETHTOOL > > +#define SIOCETHTOOL 0x8946 > > +#endif > > +#include <fcntl.h> > > +#include <arpa/inet.h> > > +#include <linux/if_link.h> > > +#include <sys/utsname.h> > > +#include <bpf/bpf.h> > > +#include <bpf/libbpf.h> > > +#include "bpf_util.h" > > +#include "xdp_sample_user.h" > > nit: Pls order the above a bit better. > Ok, will do. > > +#define __sample_print(fmt, cond, printer, ...) \ > > + ({ \ > > + if (cond) \ > > + printer(fmt, ##__VA_ARGS__); \ > > + }) > > + > > +#define print_always(fmt, ...) __sample_print(fmt, 1, printf, ##__VA_ARGS__) > > +#define print_default(fmt, ...) \ > > + __sample_print(fmt, sample.log_level & LL_DEFAULT, printf, ##__VA_ARGS__) > > +#define __print_err(err, fmt, printer, ...) \ > > + ({ \ > > + __sample_print(fmt, err > 0 || sample.log_level & LL_DEFAULT, \ > > + printer, ##__VA_ARGS__); \ > > + sample.err_exp = sample.err_exp ? true : err > 0; \ > > + }) > > +#define print_err(err, fmt, ...) __print_err(err, fmt, printf, ##__VA_ARGS__) > > + > > +#define print_link_err(err, str, width, type) \ > > + __print_err(err, str, print_link, width, type) > > Couldn't this all be more generalized so other non-XDP samples can use it as well > potentially? > It's a bit tied into what the log level means, but I'll try. > > +#define __COLUMN(x) "%'10" x " %-13s" > > +#define FMT_COLUMNf __COLUMN(".0f") > > +#define FMT_COLUMNd __COLUMN("d") > > +#define FMT_COLUMNl __COLUMN("llu") > > +#define RX(rx) rx, "rx/s" > > +#define PPS(pps) pps, "pkt/s" > > +#define DROP(drop) drop, "drop/s" > > +#define ERR(err) err, "error/s" > > +#define HITS(hits) hits, "hit/s" > > +#define XMIT(xmit) xmit, "xmit/s" > > +#define PASS(pass) pass, "pass/s" > > +#define REDIR(redir) redir, "redir/s" > > +#define NANOSEC_PER_SEC 1000000000 /* 10^9 */ > > + > > +#define XDP_UNKNOWN (XDP_REDIRECT + 1) > > +#define XDP_ACTION_MAX (XDP_UNKNOWN + 1) > > +#define XDP_REDIRECT_ERR_MAX 7 > > + > > +enum tp_type { > > + TP_REDIRECT_CNT, > > + TP_REDIRECT_MAP_CNT, > > + TP_REDIRECT_ERR_CNT, > > + TP_REDIRECT_MAP_ERR_CNT, > > + TP_CPUMAP_ENQUEUE_CNT, > > + TP_CPUMAP_KTHREAD_CNT, > > + TP_EXCEPTION_CNT, > > + TP_DEVMAP_XMIT_CNT, > > + NUM_TP, > > +}; > > + > > +enum map_type { > > + MAP_DEVMAP_XMIT_MULTI, > > + NUM_MAP, > > +}; > > + > > +enum log_level { > > + LL_DEFAULT = 1U << 0, > > + LL_SIMPLE = 1U << 1, > > + LL_DEBUG = 1U << 2, > > +}; > > + > > +struct record { > > + __u64 timestamp; > > + struct datarec total; > > + struct datarec *cpu; > > +}; > > + > > +struct stats_record { > > + struct record rx_cnt; > > + struct record redir_err[XDP_REDIRECT_ERR_MAX]; > > + struct record kthread; > > + struct record exception[XDP_ACTION_MAX]; > > + struct record devmap_xmit; > > + struct hash_map *devmap_xmit_multi; > > + struct record enq[]; > > +}; > > + > > +struct sample_output { > > + struct { > > + __u64 rx; > > + __u64 redir; > > + __u64 drop; > > + __u64 drop_xmit; > > + __u64 err; > > + __u64 xmit; > > + } totals; > > + struct { > > + __u64 pps; > > + __u64 drop; > > + __u64 err; > > + } rx_cnt; > > + struct { > > + __u64 suc; > > + __u64 err; > > + } redir_cnt; > > + struct { > > + __u64 hits; > > + } except_cnt; > > + struct { > > + __u64 pps; > > + __u64 drop; > > + __u64 err; > > + double bavg; > > + } xmit_cnt; > > +}; > > + > > +static struct { > > + enum log_level log_level; > > + struct sample_output out; > > + struct sample_data *data; > > + unsigned long interval; > > + int map_fd[NUM_MAP]; > > + struct xdp_desc { > > + int ifindex; > > + __u32 prog_id; > > + int flags; > > + } xdp_progs[32]; > > + bool err_exp; > > + int xdp_cnt; > > + int n_cpus; > > + int sig_fd; > > + int mask; > > +} sample; > > + > > +static const char *xdp_redirect_err_names[XDP_REDIRECT_ERR_MAX] = { > > + /* Key=1 keeps unknown errors */ > > + "Success", "Unknown", "EINVAL", "ENETDOWN", "EMSGSIZE", > > + "EOPNOTSUPP", "ENOSPC", > > +}; > > + > > +/* Keyed from Unknown */ > > +static const char *xdp_redirect_err_help[XDP_REDIRECT_ERR_MAX - 1] = { > > + "Unknown error", > > + "Invalid redirection", > > + "Device being redirected to is down", > > + "Packet length too large for device", > > + "Operation not supported", > > + "No space in ptr_ring of cpumap kthread", > > +}; > > + > > +static const char *xdp_action_names[XDP_ACTION_MAX] = { > > + [XDP_ABORTED] = "XDP_ABORTED", > > + [XDP_DROP] = "XDP_DROP", > > + [XDP_PASS] = "XDP_PASS", > > + [XDP_TX] = "XDP_TX", > > + [XDP_REDIRECT] = "XDP_REDIRECT", > > + [XDP_UNKNOWN] = "XDP_UNKNOWN", > > +}; > > + > > +static const char *elixir_search[NUM_TP] = { > > + [TP_REDIRECT_CNT] = "_trace_xdp_redirect", > > + [TP_REDIRECT_MAP_CNT] = "_trace_xdp_redirect_map", > > + [TP_REDIRECT_ERR_CNT] = "_trace_xdp_redirect_err", > > + [TP_REDIRECT_MAP_ERR_CNT] = "_trace_xdp_redirect_map_err", > > + [TP_CPUMAP_ENQUEUE_CNT] = "trace_xdp_cpumap_enqueue", > > + [TP_CPUMAP_KTHREAD_CNT] = "trace_xdp_cpumap_kthread", > > + [TP_EXCEPTION_CNT] = "trace_xdp_exception", > > + [TP_DEVMAP_XMIT_CNT] = "trace_xdp_devmap_xmit", > > +}; > > nit: inconsistent style wrt tabs on elixir_search vs xdp_action_names, generally > like the latter is preferred. Ok, I'll fix all instances. > > +/* Dumb hashmap to store ifindex pairs datarec for devmap_xmit_cnt_multi */ > > +#define HASH_BUCKETS (1 << 5) > > +#define HASH_MASK (HASH_BUCKETS - 1) > > + > > +#define hash_map_for_each(entry, map) \ > > + for (int __i = 0; __i < HASH_BUCKETS; __i++) \ > > + for (entry = map->buckets[__i]; entry; entry = entry->next) > > + > > +struct map_entry { > > + struct map_entry *next; > > + __u64 pair; > > + struct record *val; > > +}; > > + > > +struct hash_map { > > + struct map_entry *buckets[HASH_BUCKETS]; > > +}; > > + > > +/* From tools/lib/bpf/hashmap.h */ > > Should this also have same license then? > > > +static size_t hash_bits(size_t h, int bits) > > +{ > > + /* shuffle bits and return requested number of upper bits */ > > + if (bits == 0) > > + return 0; > > + > > +#if (__SIZEOF_SIZE_T__ == __SIZEOF_LONG_LONG__) > > + /* LP64 case */ > > + return (h * 11400714819323198485llu) >> (__SIZEOF_LONG_LONG__ * 8 - bits); > > +#elif (__SIZEOF_SIZE_T__ <= __SIZEOF_LONG__) > > + return (h * 2654435769lu) >> (__SIZEOF_LONG__ * 8 - bits); > > +#else > > +# error "Unsupported size_t size" > > +#endif > > +} > > + > > +static __u64 gettime(void) > > +{ > > + struct timespec t; > > + int res; > > + > > + res = clock_gettime(CLOCK_MONOTONIC, &t); > > + if (res < 0) { > > + fprintf(stderr, "Error with gettimeofday! (%i)\n", res); > > + exit(EXIT_FAIL); > > + } > > + return (__u64) t.tv_sec * NANOSEC_PER_SEC + t.tv_nsec; > > +} > > + > > +int sample_get_cpus(void) > > +{ > > + int cpus = get_nprocs_conf(); > > + return cpus < MAX_CPUS ? cpus : MAX_CPUS; > > +} > > + > > +static int hash_map_add(struct hash_map *map, __u64 pair, struct record *val) > > +{ > > + struct map_entry *e, **b; > > + > > + e = calloc(1, sizeof(*e)); > > + if (!e) > > + return -ENOMEM; > > + > > + val->timestamp = gettime(); > > + > > + e->pair = pair; > > + e->val = val; > > + > > + b = &map->buckets[hash_bits((size_t)pair, 5) & HASH_MASK]; > > + if (*b) > > + e->next = *b; > > + *b = e; > > + > > + return 0; > > +} > > + > > +static struct map_entry *hash_map_find(struct hash_map *map, __u64 pair) > > +{ > > + struct map_entry *e; > > + > > + e = map->buckets[hash_bits((size_t)pair, 5) & HASH_MASK]; > > + for (; e; e = e->next) { > > + if (e->pair == pair) > > + break; > > + } > > + > > + return e; > > +} > > + > > +static void hash_map_destroy(struct hash_map *map) > > +{ > > + if (!map) > > + return; > > + > > + for (int i = 0; i < HASH_BUCKETS; i++) { > > + struct map_entry *e, *f; > > + > > + e = map->buckets[i]; > > + while (e) { > > + f = e; > > + e = e->next; > > + free(f->val->cpu); > > + free(f->val); > > + free(f); > > + } > > + } > > +} > > There's also an existing hashtable implementation under > tools/include/linux/hashtable.h, can't we reuse that one? > I didn't notice this. I'll switch to using it. > [...] > > +static const char *make_url(enum tp_type i) > > +{ > > + const char *key = elixir_search[i]; > > + static struct utsname uts = {}; > > + static char url[128]; > > + static bool uts_init; > > + int maj, min; > > + char c[2]; > > + > > + if (!uts_init) { > > + if (uname(&uts) < 0) > > + return NULL; > > + uts_init = true; > > + } > > + > > + if (!key || sscanf(uts.release, "%d.%d%1s", &maj, &min, c) != 3) > > + return NULL; > > + > > + snprintf(url, sizeof(url), "https://elixir.bootlin.com/linux/v%d.%d/C/ident/%s", > > + maj, min, key); > > + > > + return url; > > +} > > I don't think this whole make_url() belongs in here with the link construction to > elixir.bootlin.com, it's just confusing when you have downstream trees with custom > backports. > Right, it only considers the upstream kernel. I guess I'll just drop it. [...] I'll also address everything else, thanks! -- Kartikeya