On Fri, May 13, 2022 at 05:43:51PM +0200, Uladzislau Rezki wrote: > > On Fri, May 13, 2022 at 9:36 AM Uladzislau Rezki <urezki@xxxxxxxxx> wrote: > > > > > > > > On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@xxxxxxxxx> wrote: > > > > > > > > > > > > > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote: > > > > > > > > Never mind. I port it into 5.10 > > > > > > > > > > > > > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for > > > > > > > 5.10 , although that does not have the kfree changes, everything else is > > > > > > > ditto. > > > > > > > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4 > > > > > > > > > > > > > No problem. kfree_rcu two patches are not so important in this series. > > > > > > So i have backported them into my 5.10 kernel because the latest kernel > > > > > > is not so easy to up and run on my device :) > > > > > > > > > > Actually I was going to write here, apparently some tests are showing > > > > > kfree_rcu()->call_rcu_lazy() causing possible regression. So it is > > > > > good to drop those for your initial testing! > > > > > > > > > Yep, i dropped both. The one that make use of call_rcu_lazy() seems not > > > > so important for kfree_rcu() because we do batch requests there anyway. > > > > One thing that i would like to improve in kfree_rcu() is a better utilization > > > > of page slots. > > > > > > > > I will share my results either tomorrow or on Monday. I hope that is fine. > > > > > > > > > > Here we go with some data on our Android handset that runs 5.10 kernel. The test > > > case i have checked was a "static image" use case. Condition is: screen ON with > > > disabled all connectivity. > > > > > > 1. > > > First data i took is how many wakeups cause an RCU subsystem during this test case > > > when everything is pretty idling. Duration is 360 seconds: > > > > > > <snip> > > > serezkiul@seldlx26095:~/data/call_rcu_lazy$ ./psp ./perf_360_sec_rcu_lazy_off.script | sort -nk 6 | grep rcu > > > > Nice! Do you mind sharing this script? I was just talking to Rushikesh > > that we want something like this during testing. Appreciate it. Also, > > if we dump timer wakeup reasons/callbacks that would also be awesome. > > > Please find in attachment. I wrote it once upon a time and make use of > to parse "perf script" output, i.e. raw data. The file name is: perf_script_parser.c > so just compile it. > > How to use it: > 1. run perf: './perf sched record -a -- sleep "how much in sec you want to collect data"' > 2. ./perf script -i ./perf.data > foo.script > 3. ./perf_script_parser ./foo.script Thanks a lot for sharing this. I think it will be quite useful. FWIW, I also use "perf sched record" and "perf sched report --sort latency" to get wakeup latencies. > > FWIW, I wrote a BPF tool that periodically dumps callbacks and can > > share that with you on request as well. That is probably not in a > > shape for mainline though (Makefile missing and such). > > > Yep, please share! Sure, check out my bcc repo from here: https://github.com/joelagnel/bcc Build this project, then cd libbpf-tools and run make. This should produce a static binary 'rcutop' which you can push to Android. You have to build for ARM which bcc should have instructions for. I have also included the rcuptop diff at the end of this file for reference. > > > 2. > > > Please find in attachment two power plots. The same test case. One is related to a > > > regular use of call_rcu() and second one is "lazy" usage. There is light a difference > > > in power, it is ~2mA. Event though it is rather small but it is detectable and solid > > > what is also important, thus it proofs the concept. Please note it might be more power > > > efficient for other arches and platforms. Because of different HW design that is related > > > to C-states of CPU and energy that is needed to in/out of those deep power states. > > > > Nice! I wonder if you still have other frequent callbacks on your > > system that are getting queued during the tests. Could you dump the > > rcu_callbacks trace event and see if you have any CBs frequently > > called that the series did not address? > > > I have pretty much like this: > <snip> > rcuop/2-33 [002] d..1 6172.420541: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/1-26 [001] d..1 6173.131965: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/0-15 [001] d..1 6173.696540: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/3-40 [003] d..1 6173.703695: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/0-15 [001] d..1 6173.711607: rcu_batch_start: rcu_preempt CBs=1667 bl=13 > rcuop/1-26 [000] d..1 6175.619722: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/2-33 [001] d..1 6176.135844: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/3-40 [002] d..1 6176.303723: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/0-15 [002] d..1 6176.519894: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/0-15 [003] d..1 6176.527895: rcu_batch_start: rcu_preempt CBs=273 bl=10 > rcuop/1-26 [003] d..1 6178.543729: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/1-26 [003] d..1 6178.551707: rcu_batch_start: rcu_preempt CBs=1317 bl=10 > rcuop/0-15 [003] d..1 6178.819698: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/0-15 [003] d..1 6178.827734: rcu_batch_start: rcu_preempt CBs=949 bl=10 > rcuop/3-40 [001] d..1 6179.203645: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/2-33 [001] d..1 6179.455747: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/2-33 [002] d..1 6179.471725: rcu_batch_start: rcu_preempt CBs=1983 bl=15 > rcuop/1-26 [003] d..1 6181.287646: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > rcuop/1-26 [003] d..1 6181.295607: rcu_batch_start: rcu_preempt CBs=55 bl=10 > <snip> > > so almost everything is batched. Nice, glad to know this is happening even without the kfree_rcu() changes. > > Also, one more thing I was curious about is - do you see savings when > > you pin the rcu threads to the LITTLE CPUs of the system? The theory > > being, not disturbing the BIG CPUs which are more power hungry may let > > them go into a deeper idle state and save power (due to leakage > > current and so forth). > > > I did some experimenting to pin nocbs to a little cluster. For idle use > cases i did not see any power gain. For heavy one i see that "big" CPUs > are also invoking and busy with it quite often. Probably i should think > of some use case where i can detect the power difference. If you have > something please let me know. Yeah, probably screen off + audio playback might be a good one, because it lightly loads the CPUs. > > > So a front-lazy-batching is something worth to have, IMHO :) > > > > Exciting! Being lazy pays off some times ;-) ;-). If you are Ok with > > it, we can add your data to the LPC slides as well about your > > investigation (with attribution to you). > > > No problem, since we will give a talk on LPC, more data we have more > convincing we are :) I forget, you did mention you are Ok with presenting with us right? It would be great if you present your data when we come to Android, if you are OK with it. I'll start a common slide deck soon and share so you, Rushikesh and me can add slides to it and present together. thanks, - Joel ---8<----------------------- From: Joel Fernandes <joelaf@xxxxxxxxxx> Subject: [PATCH] rcutop Signed-off-by: Joel Fernandes <joelaf@xxxxxxxxxx> --- libbpf-tools/Makefile | 1 + libbpf-tools/rcutop.bpf.c | 56 ++++++++ libbpf-tools/rcutop.c | 288 ++++++++++++++++++++++++++++++++++++++ libbpf-tools/rcutop.h | 8 ++ 4 files changed, 353 insertions(+) create mode 100644 libbpf-tools/rcutop.bpf.c create mode 100644 libbpf-tools/rcutop.c create mode 100644 libbpf-tools/rcutop.h diff --git a/libbpf-tools/Makefile b/libbpf-tools/Makefile index e60ec409..0d4cdff2 100644 --- a/libbpf-tools/Makefile +++ b/libbpf-tools/Makefile @@ -42,6 +42,7 @@ APPS = \ klockstat \ ksnoop \ llcstat \ + rcutop \ mountsnoop \ numamove \ offcputime \ diff --git a/libbpf-tools/rcutop.bpf.c b/libbpf-tools/rcutop.bpf.c new file mode 100644 index 00000000..8287bbe2 --- /dev/null +++ b/libbpf-tools/rcutop.bpf.c @@ -0,0 +1,56 @@ +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ +/* Copyright (c) 2021 Hengqi Chen */ +#include <vmlinux.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_core_read.h> +#include <bpf/bpf_tracing.h> +#include "rcutop.h" +#include "maps.bpf.h" + +#define MAX_ENTRIES 10240 + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, MAX_ENTRIES); + __type(key, void *); + __type(value, int); +} cbs_queued SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, MAX_ENTRIES); + __type(key, void *); + __type(value, int); +} cbs_executed SEC(".maps"); + +SEC("tracepoint/rcu/rcu_callback") +int tracepoint_rcu_callback(struct trace_event_raw_rcu_callback* ctx) +{ + void *key = ctx->func; + int *val = NULL; + static const int zero; + + val = bpf_map_lookup_or_try_init(&cbs_queued, &key, &zero); + if (val) { + __sync_fetch_and_add(val, 1); + } + + return 0; +} + +SEC("tracepoint/rcu/rcu_invoke_callback") +int tracepoint_rcu_invoke_callback(struct trace_event_raw_rcu_invoke_callback* ctx) +{ + void *key = ctx->func; + int *val; + int zero = 0; + + val = bpf_map_lookup_or_try_init(&cbs_executed, (void *)&key, (void *)&zero); + if (val) { + __sync_fetch_and_add(val, 1); + } + + return 0; +} + +char LICENSE[] SEC("license") = "Dual BSD/GPL"; diff --git a/libbpf-tools/rcutop.c b/libbpf-tools/rcutop.c new file mode 100644 index 00000000..35795875 --- /dev/null +++ b/libbpf-tools/rcutop.c @@ -0,0 +1,288 @@ +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ + +/* + * rcutop + * Copyright (c) 2022 Joel Fernandes + * + * 05-May-2022 Joel Fernandes Created this. + */ +#include <argp.h> +#include <errno.h> +#include <signal.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <time.h> +#include <unistd.h> + +#include <bpf/libbpf.h> +#include <bpf/bpf.h> +#include "rcutop.h" +#include "rcutop.skel.h" +#include "btf_helpers.h" +#include "trace_helpers.h" + +#define warn(...) fprintf(stderr, __VA_ARGS__) +#define OUTPUT_ROWS_LIMIT 10240 + +static volatile sig_atomic_t exiting = 0; + +static bool clear_screen = true; +static int output_rows = 20; +static int interval = 1; +static int count = 99999999; +static bool verbose = false; + +const char *argp_program_version = "rcutop 0.1"; +const char *argp_program_bug_address = +"https://github.com/iovisor/bcc/tree/master/libbpf-tools"; +const char argp_program_doc[] = +"Show RCU callback queuing and execution stats.\n" +"\n" +"USAGE: rcutop [-h] [interval] [count]\n" +"\n" +"EXAMPLES:\n" +" rcutop # rcu activity top, refresh every 1s\n" +" rcutop 5 10 # 5s summaries, 10 times\n"; + +static const struct argp_option opts[] = { + { "noclear", 'C', NULL, 0, "Don't clear the screen" }, + { "rows", 'r', "ROWS", 0, "Maximum rows to print, default 20" }, + { "verbose", 'v', NULL, 0, "Verbose debug output" }, + { NULL, 'h', NULL, OPTION_HIDDEN, "Show the full help" }, + {}, +}; + +static error_t parse_arg(int key, char *arg, struct argp_state *state) +{ + long rows; + static int pos_args; + + switch (key) { + case 'C': + clear_screen = false; + break; + case 'v': + verbose = true; + break; + case 'h': + argp_state_help(state, stderr, ARGP_HELP_STD_HELP); + break; + case 'r': + errno = 0; + rows = strtol(arg, NULL, 10); + if (errno || rows <= 0) { + warn("invalid rows: %s\n", arg); + argp_usage(state); + } + output_rows = rows; + if (output_rows > OUTPUT_ROWS_LIMIT) + output_rows = OUTPUT_ROWS_LIMIT; + break; + case ARGP_KEY_ARG: + errno = 0; + if (pos_args == 0) { + interval = strtol(arg, NULL, 10); + if (errno || interval <= 0) { + warn("invalid interval\n"); + argp_usage(state); + } + } else if (pos_args == 1) { + count = strtol(arg, NULL, 10); + if (errno || count <= 0) { + warn("invalid count\n"); + argp_usage(state); + } + } else { + warn("unrecognized positional argument: %s\n", arg); + argp_usage(state); + } + pos_args++; + break; + default: + return ARGP_ERR_UNKNOWN; + } + return 0; +} + +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args) +{ + if (level == LIBBPF_DEBUG && !verbose) + return 0; + return vfprintf(stderr, format, args); +} + +static void sig_int(int signo) +{ + exiting = 1; +} + +static int print_stat(struct ksyms *ksyms, struct syms_cache *syms_cache, + struct rcutop_bpf *obj) +{ + void *key, **prev_key = NULL; + int n, err = 0; + int qfd = bpf_map__fd(obj->maps.cbs_queued); + int efd = bpf_map__fd(obj->maps.cbs_executed); + const struct ksym *ksym; + FILE *f; + time_t t; + struct tm *tm; + char ts[16], buf[256]; + + f = fopen("/proc/loadavg", "r"); + if (f) { + time(&t); + tm = localtime(&t); + strftime(ts, sizeof(ts), "%H:%M:%S", tm); + memset(buf, 0, sizeof(buf)); + n = fread(buf, 1, sizeof(buf), f); + if (n) + printf("%8s loadavg: %s\n", ts, buf); + fclose(f); + } + + printf("%-32s %-6s %-6s\n", "Callback", "Queued", "Executed"); + + while (1) { + int qcount = 0, ecount = 0; + + err = bpf_map_get_next_key(qfd, prev_key, &key); + if (err) { + if (errno == ENOENT) { + err = 0; + break; + } + warn("bpf_map_get_next_key failed: %s\n", strerror(errno)); + return err; + } + + err = bpf_map_lookup_elem(qfd, &key, &qcount); + if (err) { + warn("bpf_map_lookup_elem failed: %s\n", strerror(errno)); + return err; + } + prev_key = &key; + + bpf_map_lookup_elem(efd, &key, &ecount); + + ksym = ksyms__map_addr(ksyms, (unsigned long)key); + printf("%-32s %-6d %-6d\n", + ksym ? ksym->name : "Unknown", + qcount, ecount); + } + printf("\n"); + prev_key = NULL; + while (1) { + err = bpf_map_get_next_key(qfd, prev_key, &key); + if (err) { + if (errno == ENOENT) { + err = 0; + break; + } + warn("bpf_map_get_next_key failed: %s\n", strerror(errno)); + return err; + } + err = bpf_map_delete_elem(qfd, &key); + if (err) { + if (errno == ENOENT) { + err = 0; + continue; + } + warn("bpf_map_delete_elem failed: %s\n", strerror(errno)); + return err; + } + + bpf_map_delete_elem(efd, &key); + prev_key = &key; + } + + return err; +} + +int main(int argc, char **argv) +{ + LIBBPF_OPTS(bpf_object_open_opts, open_opts); + static const struct argp argp = { + .options = opts, + .parser = parse_arg, + .doc = argp_program_doc, + }; + struct rcutop_bpf *obj; + int err; + struct syms_cache *syms_cache = NULL; + struct ksyms *ksyms = NULL; + + err = argp_parse(&argp, argc, argv, 0, NULL, NULL); + if (err) + return err; + + libbpf_set_strict_mode(LIBBPF_STRICT_ALL); + libbpf_set_print(libbpf_print_fn); + + err = ensure_core_btf(&open_opts); + if (err) { + fprintf(stderr, "failed to fetch necessary BTF for CO-RE: %s\n", strerror(-err)); + return 1; + } + + obj = rcutop_bpf__open_opts(&open_opts); + if (!obj) { + warn("failed to open BPF object\n"); + return 1; + } + + err = rcutop_bpf__load(obj); + if (err) { + warn("failed to load BPF object: %d\n", err); + goto cleanup; + } + + err = rcutop_bpf__attach(obj); + if (err) { + warn("failed to attach BPF programs: %d\n", err); + goto cleanup; + } + + ksyms = ksyms__load(); + if (!ksyms) { + fprintf(stderr, "failed to load kallsyms\n"); + goto cleanup; + } + + syms_cache = syms_cache__new(0); + if (!syms_cache) { + fprintf(stderr, "failed to create syms_cache\n"); + goto cleanup; + } + + if (signal(SIGINT, sig_int) == SIG_ERR) { + warn("can't set signal handler: %s\n", strerror(errno)); + err = 1; + goto cleanup; + } + + while (1) { + sleep(interval); + + if (clear_screen) { + err = system("clear"); + if (err) + goto cleanup; + } + + err = print_stat(ksyms, syms_cache, obj); + if (err) + goto cleanup; + + count--; + if (exiting || !count) + goto cleanup; + } + +cleanup: + rcutop_bpf__destroy(obj); + cleanup_core_btf(&open_opts); + + return err != 0; +} diff --git a/libbpf-tools/rcutop.h b/libbpf-tools/rcutop.h new file mode 100644 index 00000000..cb2a3557 --- /dev/null +++ b/libbpf-tools/rcutop.h @@ -0,0 +1,8 @@ +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ +#ifndef __RCUTOP_H +#define __RCUTOP_H + +#define PATH_MAX 4096 +#define TASK_COMM_LEN 16 + +#endif /* __RCUTOP_H */ -- 2.36.0.550.gb090851708-goog