On Sat, May 14, 2022 at 10:25 AM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote: > > On Fri, May 13, 2022 at 05:43:51PM +0200, Uladzislau Rezki wrote: > > > On Fri, May 13, 2022 at 9:36 AM Uladzislau Rezki <urezki@xxxxxxxxx> wrote: > > > > > > > > > > On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@xxxxxxxxx> wrote: > > > > > > > > > > > > > > > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote: > > > > > > > > > Never mind. I port it into 5.10 > > > > > > > > > > > > > > > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for > > > > > > > > 5.10 , although that does not have the kfree changes, everything else is > > > > > > > > ditto. > > > > > > > > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4 > > > > > > > > > > > > > > > No problem. kfree_rcu two patches are not so important in this series. > > > > > > > So i have backported them into my 5.10 kernel because the latest kernel > > > > > > > is not so easy to up and run on my device :) > > > > > > > > > > > > Actually I was going to write here, apparently some tests are showing > > > > > > kfree_rcu()->call_rcu_lazy() causing possible regression. So it is > > > > > > good to drop those for your initial testing! > > > > > > > > > > > Yep, i dropped both. The one that make use of call_rcu_lazy() seems not > > > > > so important for kfree_rcu() because we do batch requests there anyway. > > > > > One thing that i would like to improve in kfree_rcu() is a better utilization > > > > > of page slots. > > > > > > > > > > I will share my results either tomorrow or on Monday. I hope that is fine. > > > > > > > > > > > > > Here we go with some data on our Android handset that runs 5.10 kernel. The test > > > > case i have checked was a "static image" use case. Condition is: screen ON with > > > > disabled all connectivity. > > > > > > > > 1. > > > > First data i took is how many wakeups cause an RCU subsystem during this test case > > > > when everything is pretty idling. Duration is 360 seconds: > > > > > > > > <snip> > > > > serezkiul@seldlx26095:~/data/call_rcu_lazy$ ./psp ./perf_360_sec_rcu_lazy_off.script | sort -nk 6 | grep rcu > > > > > > Nice! Do you mind sharing this script? I was just talking to Rushikesh > > > that we want something like this during testing. Appreciate it. Also, > > > if we dump timer wakeup reasons/callbacks that would also be awesome. > > > > > Please find in attachment. I wrote it once upon a time and make use of > > to parse "perf script" output, i.e. raw data. The file name is: perf_script_parser.c > > so just compile it. > > > > How to use it: > > 1. run perf: './perf sched record -a -- sleep "how much in sec you want to collect data"' > > 2. ./perf script -i ./perf.data > foo.script > > 3. ./perf_script_parser ./foo.script > > Thanks a lot for sharing this. I think it will be quite useful. FWIW, I also > use "perf sched record" and "perf sched report --sort latency" to get wakeup > latencies. > > > > FWIW, I wrote a BPF tool that periodically dumps callbacks and can > > > share that with you on request as well. That is probably not in a > > > shape for mainline though (Makefile missing and such). > > > > > Yep, please share! > > Sure, check out my bcc repo from here: > https://github.com/joelagnel/bcc Kind of random, but this is why I love sharing my tools with others. I actually forgot that I checked in the code for my rcutop here, and this email is how I found it, lol :) Did you happen to try it out too btw? :) Thanks, Joel > > Build this project, then cd libbpf-tools and run make. This should produce a > static binary 'rcutop' which you can push to Android. You have to build for > ARM which bcc should have instructions for. I have also included the rcuptop > diff at the end of this file for reference. > > > > > 2. > > > > Please find in attachment two power plots. The same test case. One is related to a > > > > regular use of call_rcu() and second one is "lazy" usage. There is light a difference > > > > in power, it is ~2mA. Event though it is rather small but it is detectable and solid > > > > what is also important, thus it proofs the concept. Please note it might be more power > > > > efficient for other arches and platforms. Because of different HW design that is related > > > > to C-states of CPU and energy that is needed to in/out of those deep power states. > > > > > > Nice! I wonder if you still have other frequent callbacks on your > > > system that are getting queued during the tests. Could you dump the > > > rcu_callbacks trace event and see if you have any CBs frequently > > > called that the series did not address? > > > > > I have pretty much like this: > > <snip> > > rcuop/2-33 [002] d..1 6172.420541: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/1-26 [001] d..1 6173.131965: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/0-15 [001] d..1 6173.696540: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/3-40 [003] d..1 6173.703695: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/0-15 [001] d..1 6173.711607: rcu_batch_start: rcu_preempt CBs=1667 bl=13 > > rcuop/1-26 [000] d..1 6175.619722: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/2-33 [001] d..1 6176.135844: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/3-40 [002] d..1 6176.303723: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/0-15 [002] d..1 6176.519894: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/0-15 [003] d..1 6176.527895: rcu_batch_start: rcu_preempt CBs=273 bl=10 > > rcuop/1-26 [003] d..1 6178.543729: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/1-26 [003] d..1 6178.551707: rcu_batch_start: rcu_preempt CBs=1317 bl=10 > > rcuop/0-15 [003] d..1 6178.819698: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/0-15 [003] d..1 6178.827734: rcu_batch_start: rcu_preempt CBs=949 bl=10 > > rcuop/3-40 [001] d..1 6179.203645: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/2-33 [001] d..1 6179.455747: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/2-33 [002] d..1 6179.471725: rcu_batch_start: rcu_preempt CBs=1983 bl=15 > > rcuop/1-26 [003] d..1 6181.287646: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/1-26 [003] d..1 6181.295607: rcu_batch_start: rcu_preempt CBs=55 bl=10 > > <snip> > > > > so almost everything is batched. > > Nice, glad to know this is happening even without the kfree_rcu() changes. > > > > Also, one more thing I was curious about is - do you see savings when > > > you pin the rcu threads to the LITTLE CPUs of the system? The theory > > > being, not disturbing the BIG CPUs which are more power hungry may let > > > them go into a deeper idle state and save power (due to leakage > > > current and so forth). > > > > > I did some experimenting to pin nocbs to a little cluster. For idle use > > cases i did not see any power gain. For heavy one i see that "big" CPUs > > are also invoking and busy with it quite often. Probably i should think > > of some use case where i can detect the power difference. If you have > > something please let me know. > > Yeah, probably screen off + audio playback might be a good one, because it > lightly loads the CPUs. > > > > > So a front-lazy-batching is something worth to have, IMHO :) > > > > > > Exciting! Being lazy pays off some times ;-) ;-). If you are Ok with > > > it, we can add your data to the LPC slides as well about your > > > investigation (with attribution to you). > > > > > No problem, since we will give a talk on LPC, more data we have more > > convincing we are :) > > I forget, you did mention you are Ok with presenting with us right? It would > be great if you present your data when we come to Android, if you are OK with > it. I'll start a common slide deck soon and share so you, Rushikesh and me > can add slides to it and present together. > > thanks, > > - Joel > > ---8<----------------------- > > From: Joel Fernandes <joelaf@xxxxxxxxxx> > Subject: [PATCH] rcutop > > Signed-off-by: Joel Fernandes <joelaf@xxxxxxxxxx> > --- > libbpf-tools/Makefile | 1 + > libbpf-tools/rcutop.bpf.c | 56 ++++++++ > libbpf-tools/rcutop.c | 288 ++++++++++++++++++++++++++++++++++++++ > libbpf-tools/rcutop.h | 8 ++ > 4 files changed, 353 insertions(+) > create mode 100644 libbpf-tools/rcutop.bpf.c > create mode 100644 libbpf-tools/rcutop.c > create mode 100644 libbpf-tools/rcutop.h > > diff --git a/libbpf-tools/Makefile b/libbpf-tools/Makefile > index e60ec409..0d4cdff2 100644 > --- a/libbpf-tools/Makefile > +++ b/libbpf-tools/Makefile > @@ -42,6 +42,7 @@ APPS = \ > klockstat \ > ksnoop \ > llcstat \ > + rcutop \ > mountsnoop \ > numamove \ > offcputime \ > diff --git a/libbpf-tools/rcutop.bpf.c b/libbpf-tools/rcutop.bpf.c > new file mode 100644 > index 00000000..8287bbe2 > --- /dev/null > +++ b/libbpf-tools/rcutop.bpf.c > @@ -0,0 +1,56 @@ > +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ > +/* Copyright (c) 2021 Hengqi Chen */ > +#include <vmlinux.h> > +#include <bpf/bpf_helpers.h> > +#include <bpf/bpf_core_read.h> > +#include <bpf/bpf_tracing.h> > +#include "rcutop.h" > +#include "maps.bpf.h" > + > +#define MAX_ENTRIES 10240 > + > +struct { > + __uint(type, BPF_MAP_TYPE_HASH); > + __uint(max_entries, MAX_ENTRIES); > + __type(key, void *); > + __type(value, int); > +} cbs_queued SEC(".maps"); > + > +struct { > + __uint(type, BPF_MAP_TYPE_HASH); > + __uint(max_entries, MAX_ENTRIES); > + __type(key, void *); > + __type(value, int); > +} cbs_executed SEC(".maps"); > + > +SEC("tracepoint/rcu/rcu_callback") > +int tracepoint_rcu_callback(struct trace_event_raw_rcu_callback* ctx) > +{ > + void *key = ctx->func; > + int *val = NULL; > + static const int zero; > + > + val = bpf_map_lookup_or_try_init(&cbs_queued, &key, &zero); > + if (val) { > + __sync_fetch_and_add(val, 1); > + } > + > + return 0; > +} > + > +SEC("tracepoint/rcu/rcu_invoke_callback") > +int tracepoint_rcu_invoke_callback(struct trace_event_raw_rcu_invoke_callback* ctx) > +{ > + void *key = ctx->func; > + int *val; > + int zero = 0; > + > + val = bpf_map_lookup_or_try_init(&cbs_executed, (void *)&key, (void *)&zero); > + if (val) { > + __sync_fetch_and_add(val, 1); > + } > + > + return 0; > +} > + > +char LICENSE[] SEC("license") = "Dual BSD/GPL"; > diff --git a/libbpf-tools/rcutop.c b/libbpf-tools/rcutop.c > new file mode 100644 > index 00000000..35795875 > --- /dev/null > +++ b/libbpf-tools/rcutop.c > @@ -0,0 +1,288 @@ > +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ > + > +/* > + * rcutop > + * Copyright (c) 2022 Joel Fernandes > + * > + * 05-May-2022 Joel Fernandes Created this. > + */ > +#include <argp.h> > +#include <errno.h> > +#include <signal.h> > +#include <stdio.h> > +#include <stdlib.h> > +#include <string.h> > +#include <time.h> > +#include <unistd.h> > + > +#include <bpf/libbpf.h> > +#include <bpf/bpf.h> > +#include "rcutop.h" > +#include "rcutop.skel.h" > +#include "btf_helpers.h" > +#include "trace_helpers.h" > + > +#define warn(...) fprintf(stderr, __VA_ARGS__) > +#define OUTPUT_ROWS_LIMIT 10240 > + > +static volatile sig_atomic_t exiting = 0; > + > +static bool clear_screen = true; > +static int output_rows = 20; > +static int interval = 1; > +static int count = 99999999; > +static bool verbose = false; > + > +const char *argp_program_version = "rcutop 0.1"; > +const char *argp_program_bug_address = > +"https://github.com/iovisor/bcc/tree/master/libbpf-tools"; > +const char argp_program_doc[] = > +"Show RCU callback queuing and execution stats.\n" > +"\n" > +"USAGE: rcutop [-h] [interval] [count]\n" > +"\n" > +"EXAMPLES:\n" > +" rcutop # rcu activity top, refresh every 1s\n" > +" rcutop 5 10 # 5s summaries, 10 times\n"; > + > +static const struct argp_option opts[] = { > + { "noclear", 'C', NULL, 0, "Don't clear the screen" }, > + { "rows", 'r', "ROWS", 0, "Maximum rows to print, default 20" }, > + { "verbose", 'v', NULL, 0, "Verbose debug output" }, > + { NULL, 'h', NULL, OPTION_HIDDEN, "Show the full help" }, > + {}, > +}; > + > +static error_t parse_arg(int key, char *arg, struct argp_state *state) > +{ > + long rows; > + static int pos_args; > + > + switch (key) { > + case 'C': > + clear_screen = false; > + break; > + case 'v': > + verbose = true; > + break; > + case 'h': > + argp_state_help(state, stderr, ARGP_HELP_STD_HELP); > + break; > + case 'r': > + errno = 0; > + rows = strtol(arg, NULL, 10); > + if (errno || rows <= 0) { > + warn("invalid rows: %s\n", arg); > + argp_usage(state); > + } > + output_rows = rows; > + if (output_rows > OUTPUT_ROWS_LIMIT) > + output_rows = OUTPUT_ROWS_LIMIT; > + break; > + case ARGP_KEY_ARG: > + errno = 0; > + if (pos_args == 0) { > + interval = strtol(arg, NULL, 10); > + if (errno || interval <= 0) { > + warn("invalid interval\n"); > + argp_usage(state); > + } > + } else if (pos_args == 1) { > + count = strtol(arg, NULL, 10); > + if (errno || count <= 0) { > + warn("invalid count\n"); > + argp_usage(state); > + } > + } else { > + warn("unrecognized positional argument: %s\n", arg); > + argp_usage(state); > + } > + pos_args++; > + break; > + default: > + return ARGP_ERR_UNKNOWN; > + } > + return 0; > +} > + > +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args) > +{ > + if (level == LIBBPF_DEBUG && !verbose) > + return 0; > + return vfprintf(stderr, format, args); > +} > + > +static void sig_int(int signo) > +{ > + exiting = 1; > +} > + > +static int print_stat(struct ksyms *ksyms, struct syms_cache *syms_cache, > + struct rcutop_bpf *obj) > +{ > + void *key, **prev_key = NULL; > + int n, err = 0; > + int qfd = bpf_map__fd(obj->maps.cbs_queued); > + int efd = bpf_map__fd(obj->maps.cbs_executed); > + const struct ksym *ksym; > + FILE *f; > + time_t t; > + struct tm *tm; > + char ts[16], buf[256]; > + > + f = fopen("/proc/loadavg", "r"); > + if (f) { > + time(&t); > + tm = localtime(&t); > + strftime(ts, sizeof(ts), "%H:%M:%S", tm); > + memset(buf, 0, sizeof(buf)); > + n = fread(buf, 1, sizeof(buf), f); > + if (n) > + printf("%8s loadavg: %s\n", ts, buf); > + fclose(f); > + } > + > + printf("%-32s %-6s %-6s\n", "Callback", "Queued", "Executed"); > + > + while (1) { > + int qcount = 0, ecount = 0; > + > + err = bpf_map_get_next_key(qfd, prev_key, &key); > + if (err) { > + if (errno == ENOENT) { > + err = 0; > + break; > + } > + warn("bpf_map_get_next_key failed: %s\n", strerror(errno)); > + return err; > + } > + > + err = bpf_map_lookup_elem(qfd, &key, &qcount); > + if (err) { > + warn("bpf_map_lookup_elem failed: %s\n", strerror(errno)); > + return err; > + } > + prev_key = &key; > + > + bpf_map_lookup_elem(efd, &key, &ecount); > + > + ksym = ksyms__map_addr(ksyms, (unsigned long)key); > + printf("%-32s %-6d %-6d\n", > + ksym ? ksym->name : "Unknown", > + qcount, ecount); > + } > + printf("\n"); > + prev_key = NULL; > + while (1) { > + err = bpf_map_get_next_key(qfd, prev_key, &key); > + if (err) { > + if (errno == ENOENT) { > + err = 0; > + break; > + } > + warn("bpf_map_get_next_key failed: %s\n", strerror(errno)); > + return err; > + } > + err = bpf_map_delete_elem(qfd, &key); > + if (err) { > + if (errno == ENOENT) { > + err = 0; > + continue; > + } > + warn("bpf_map_delete_elem failed: %s\n", strerror(errno)); > + return err; > + } > + > + bpf_map_delete_elem(efd, &key); > + prev_key = &key; > + } > + > + return err; > +} > + > +int main(int argc, char **argv) > +{ > + LIBBPF_OPTS(bpf_object_open_opts, open_opts); > + static const struct argp argp = { > + .options = opts, > + .parser = parse_arg, > + .doc = argp_program_doc, > + }; > + struct rcutop_bpf *obj; > + int err; > + struct syms_cache *syms_cache = NULL; > + struct ksyms *ksyms = NULL; > + > + err = argp_parse(&argp, argc, argv, 0, NULL, NULL); > + if (err) > + return err; > + > + libbpf_set_strict_mode(LIBBPF_STRICT_ALL); > + libbpf_set_print(libbpf_print_fn); > + > + err = ensure_core_btf(&open_opts); > + if (err) { > + fprintf(stderr, "failed to fetch necessary BTF for CO-RE: %s\n", strerror(-err)); > + return 1; > + } > + > + obj = rcutop_bpf__open_opts(&open_opts); > + if (!obj) { > + warn("failed to open BPF object\n"); > + return 1; > + } > + > + err = rcutop_bpf__load(obj); > + if (err) { > + warn("failed to load BPF object: %d\n", err); > + goto cleanup; > + } > + > + err = rcutop_bpf__attach(obj); > + if (err) { > + warn("failed to attach BPF programs: %d\n", err); > + goto cleanup; > + } > + > + ksyms = ksyms__load(); > + if (!ksyms) { > + fprintf(stderr, "failed to load kallsyms\n"); > + goto cleanup; > + } > + > + syms_cache = syms_cache__new(0); > + if (!syms_cache) { > + fprintf(stderr, "failed to create syms_cache\n"); > + goto cleanup; > + } > + > + if (signal(SIGINT, sig_int) == SIG_ERR) { > + warn("can't set signal handler: %s\n", strerror(errno)); > + err = 1; > + goto cleanup; > + } > + > + while (1) { > + sleep(interval); > + > + if (clear_screen) { > + err = system("clear"); > + if (err) > + goto cleanup; > + } > + > + err = print_stat(ksyms, syms_cache, obj); > + if (err) > + goto cleanup; > + > + count--; > + if (exiting || !count) > + goto cleanup; > + } > + > +cleanup: > + rcutop_bpf__destroy(obj); > + cleanup_core_btf(&open_opts); > + > + return err != 0; > +} > diff --git a/libbpf-tools/rcutop.h b/libbpf-tools/rcutop.h > new file mode 100644 > index 00000000..cb2a3557 > --- /dev/null > +++ b/libbpf-tools/rcutop.h > @@ -0,0 +1,8 @@ > +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ > +#ifndef __RCUTOP_H > +#define __RCUTOP_H > + > +#define PATH_MAX 4096 > +#define TASK_COMM_LEN 16 > + > +#endif /* __RCUTOP_H */ > -- > 2.36.0.550.gb090851708-goog > On Sat, May 14, 2022 at 10:25 AM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote: > > On Fri, May 13, 2022 at 05:43:51PM +0200, Uladzislau Rezki wrote: > > > On Fri, May 13, 2022 at 9:36 AM Uladzislau Rezki <urezki@xxxxxxxxx> wrote: > > > > > > > > > > On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@xxxxxxxxx> wrote: > > > > > > > > > > > > > > > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote: > > > > > > > > > Never mind. I port it into 5.10 > > > > > > > > > > > > > > > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for > > > > > > > > 5.10 , although that does not have the kfree changes, everything else is > > > > > > > > ditto. > > > > > > > > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4 > > > > > > > > > > > > > > > No problem. kfree_rcu two patches are not so important in this series. > > > > > > > So i have backported them into my 5.10 kernel because the latest kernel > > > > > > > is not so easy to up and run on my device :) > > > > > > > > > > > > Actually I was going to write here, apparently some tests are showing > > > > > > kfree_rcu()->call_rcu_lazy() causing possible regression. So it is > > > > > > good to drop those for your initial testing! > > > > > > > > > > > Yep, i dropped both. The one that make use of call_rcu_lazy() seems not > > > > > so important for kfree_rcu() because we do batch requests there anyway. > > > > > One thing that i would like to improve in kfree_rcu() is a better utilization > > > > > of page slots. > > > > > > > > > > I will share my results either tomorrow or on Monday. I hope that is fine. > > > > > > > > > > > > > Here we go with some data on our Android handset that runs 5.10 kernel. The test > > > > case i have checked was a "static image" use case. Condition is: screen ON with > > > > disabled all connectivity. > > > > > > > > 1. > > > > First data i took is how many wakeups cause an RCU subsystem during this test case > > > > when everything is pretty idling. Duration is 360 seconds: > > > > > > > > <snip> > > > > serezkiul@seldlx26095:~/data/call_rcu_lazy$ ./psp ./perf_360_sec_rcu_lazy_off.script | sort -nk 6 | grep rcu > > > > > > Nice! Do you mind sharing this script? I was just talking to Rushikesh > > > that we want something like this during testing. Appreciate it. Also, > > > if we dump timer wakeup reasons/callbacks that would also be awesome. > > > > > Please find in attachment. I wrote it once upon a time and make use of > > to parse "perf script" output, i.e. raw data. The file name is: perf_script_parser.c > > so just compile it. > > > > How to use it: > > 1. run perf: './perf sched record -a -- sleep "how much in sec you want to collect data"' > > 2. ./perf script -i ./perf.data > foo.script > > 3. ./perf_script_parser ./foo.script > > Thanks a lot for sharing this. I think it will be quite useful. FWIW, I also > use "perf sched record" and "perf sched report --sort latency" to get wakeup > latencies. > > > > FWIW, I wrote a BPF tool that periodically dumps callbacks and can > > > share that with you on request as well. That is probably not in a > > > shape for mainline though (Makefile missing and such). > > > > > Yep, please share! > > Sure, check out my bcc repo from here: > https://github.com/joelagnel/bcc > > Build this project, then cd libbpf-tools and run make. This should produce a > static binary 'rcutop' which you can push to Android. You have to build for > ARM which bcc should have instructions for. I have also included the rcuptop > diff at the end of this file for reference. > > > > > 2. > > > > Please find in attachment two power plots. The same test case. One is related to a > > > > regular use of call_rcu() and second one is "lazy" usage. There is light a difference > > > > in power, it is ~2mA. Event though it is rather small but it is detectable and solid > > > > what is also important, thus it proofs the concept. Please note it might be more power > > > > efficient for other arches and platforms. Because of different HW design that is related > > > > to C-states of CPU and energy that is needed to in/out of those deep power states. > > > > > > Nice! I wonder if you still have other frequent callbacks on your > > > system that are getting queued during the tests. Could you dump the > > > rcu_callbacks trace event and see if you have any CBs frequently > > > called that the series did not address? > > > > > I have pretty much like this: > > <snip> > > rcuop/2-33 [002] d..1 6172.420541: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/1-26 [001] d..1 6173.131965: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/0-15 [001] d..1 6173.696540: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/3-40 [003] d..1 6173.703695: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/0-15 [001] d..1 6173.711607: rcu_batch_start: rcu_preempt CBs=1667 bl=13 > > rcuop/1-26 [000] d..1 6175.619722: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/2-33 [001] d..1 6176.135844: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/3-40 [002] d..1 6176.303723: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/0-15 [002] d..1 6176.519894: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/0-15 [003] d..1 6176.527895: rcu_batch_start: rcu_preempt CBs=273 bl=10 > > rcuop/1-26 [003] d..1 6178.543729: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/1-26 [003] d..1 6178.551707: rcu_batch_start: rcu_preempt CBs=1317 bl=10 > > rcuop/0-15 [003] d..1 6178.819698: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/0-15 [003] d..1 6178.827734: rcu_batch_start: rcu_preempt CBs=949 bl=10 > > rcuop/3-40 [001] d..1 6179.203645: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/2-33 [001] d..1 6179.455747: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/2-33 [002] d..1 6179.471725: rcu_batch_start: rcu_preempt CBs=1983 bl=15 > > rcuop/1-26 [003] d..1 6181.287646: rcu_batch_start: rcu_preempt CBs=2048 bl=16 > > rcuop/1-26 [003] d..1 6181.295607: rcu_batch_start: rcu_preempt CBs=55 bl=10 > > <snip> > > > > so almost everything is batched. > > Nice, glad to know this is happening even without the kfree_rcu() changes. > > > > Also, one more thing I was curious about is - do you see savings when > > > you pin the rcu threads to the LITTLE CPUs of the system? The theory > > > being, not disturbing the BIG CPUs which are more power hungry may let > > > them go into a deeper idle state and save power (due to leakage > > > current and so forth). > > > > > I did some experimenting to pin nocbs to a little cluster. For idle use > > cases i did not see any power gain. For heavy one i see that "big" CPUs > > are also invoking and busy with it quite often. Probably i should think > > of some use case where i can detect the power difference. If you have > > something please let me know. > > Yeah, probably screen off + audio playback might be a good one, because it > lightly loads the CPUs. > > > > > So a front-lazy-batching is something worth to have, IMHO :) > > > > > > Exciting! Being lazy pays off some times ;-) ;-). If you are Ok with > > > it, we can add your data to the LPC slides as well about your > > > investigation (with attribution to you). > > > > > No problem, since we will give a talk on LPC, more data we have more > > convincing we are :) > > I forget, you did mention you are Ok with presenting with us right? It would > be great if you present your data when we come to Android, if you are OK with > it. I'll start a common slide deck soon and share so you, Rushikesh and me > can add slides to it and present together. > > thanks, > > - Joel > > ---8<----------------------- > > From: Joel Fernandes <joelaf@xxxxxxxxxx> > Subject: [PATCH] rcutop > > Signed-off-by: Joel Fernandes <joelaf@xxxxxxxxxx> > --- > libbpf-tools/Makefile | 1 + > libbpf-tools/rcutop.bpf.c | 56 ++++++++ > libbpf-tools/rcutop.c | 288 ++++++++++++++++++++++++++++++++++++++ > libbpf-tools/rcutop.h | 8 ++ > 4 files changed, 353 insertions(+) > create mode 100644 libbpf-tools/rcutop.bpf.c > create mode 100644 libbpf-tools/rcutop.c > create mode 100644 libbpf-tools/rcutop.h > > diff --git a/libbpf-tools/Makefile b/libbpf-tools/Makefile > index e60ec409..0d4cdff2 100644 > --- a/libbpf-tools/Makefile > +++ b/libbpf-tools/Makefile > @@ -42,6 +42,7 @@ APPS = \ > klockstat \ > ksnoop \ > llcstat \ > + rcutop \ > mountsnoop \ > numamove \ > offcputime \ > diff --git a/libbpf-tools/rcutop.bpf.c b/libbpf-tools/rcutop.bpf.c > new file mode 100644 > index 00000000..8287bbe2 > --- /dev/null > +++ b/libbpf-tools/rcutop.bpf.c > @@ -0,0 +1,56 @@ > +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ > +/* Copyright (c) 2021 Hengqi Chen */ > +#include <vmlinux.h> > +#include <bpf/bpf_helpers.h> > +#include <bpf/bpf_core_read.h> > +#include <bpf/bpf_tracing.h> > +#include "rcutop.h" > +#include "maps.bpf.h" > + > +#define MAX_ENTRIES 10240 > + > +struct { > + __uint(type, BPF_MAP_TYPE_HASH); > + __uint(max_entries, MAX_ENTRIES); > + __type(key, void *); > + __type(value, int); > +} cbs_queued SEC(".maps"); > + > +struct { > + __uint(type, BPF_MAP_TYPE_HASH); > + __uint(max_entries, MAX_ENTRIES); > + __type(key, void *); > + __type(value, int); > +} cbs_executed SEC(".maps"); > + > +SEC("tracepoint/rcu/rcu_callback") > +int tracepoint_rcu_callback(struct trace_event_raw_rcu_callback* ctx) > +{ > + void *key = ctx->func; > + int *val = NULL; > + static const int zero; > + > + val = bpf_map_lookup_or_try_init(&cbs_queued, &key, &zero); > + if (val) { > + __sync_fetch_and_add(val, 1); > + } > + > + return 0; > +} > + > +SEC("tracepoint/rcu/rcu_invoke_callback") > +int tracepoint_rcu_invoke_callback(struct trace_event_raw_rcu_invoke_callback* ctx) > +{ > + void *key = ctx->func; > + int *val; > + int zero = 0; > + > + val = bpf_map_lookup_or_try_init(&cbs_executed, (void *)&key, (void *)&zero); > + if (val) { > + __sync_fetch_and_add(val, 1); > + } > + > + return 0; > +} > + > +char LICENSE[] SEC("license") = "Dual BSD/GPL"; > diff --git a/libbpf-tools/rcutop.c b/libbpf-tools/rcutop.c > new file mode 100644 > index 00000000..35795875 > --- /dev/null > +++ b/libbpf-tools/rcutop.c > @@ -0,0 +1,288 @@ > +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ > + > +/* > + * rcutop > + * Copyright (c) 2022 Joel Fernandes > + * > + * 05-May-2022 Joel Fernandes Created this. > + */ > +#include <argp.h> > +#include <errno.h> > +#include <signal.h> > +#include <stdio.h> > +#include <stdlib.h> > +#include <string.h> > +#include <time.h> > +#include <unistd.h> > + > +#include <bpf/libbpf.h> > +#include <bpf/bpf.h> > +#include "rcutop.h" > +#include "rcutop.skel.h" > +#include "btf_helpers.h" > +#include "trace_helpers.h" > + > +#define warn(...) fprintf(stderr, __VA_ARGS__) > +#define OUTPUT_ROWS_LIMIT 10240 > + > +static volatile sig_atomic_t exiting = 0; > + > +static bool clear_screen = true; > +static int output_rows = 20; > +static int interval = 1; > +static int count = 99999999; > +static bool verbose = false; > + > +const char *argp_program_version = "rcutop 0.1"; > +const char *argp_program_bug_address = > +"https://github.com/iovisor/bcc/tree/master/libbpf-tools"; > +const char argp_program_doc[] = > +"Show RCU callback queuing and execution stats.\n" > +"\n" > +"USAGE: rcutop [-h] [interval] [count]\n" > +"\n" > +"EXAMPLES:\n" > +" rcutop # rcu activity top, refresh every 1s\n" > +" rcutop 5 10 # 5s summaries, 10 times\n"; > + > +static const struct argp_option opts[] = { > + { "noclear", 'C', NULL, 0, "Don't clear the screen" }, > + { "rows", 'r', "ROWS", 0, "Maximum rows to print, default 20" }, > + { "verbose", 'v', NULL, 0, "Verbose debug output" }, > + { NULL, 'h', NULL, OPTION_HIDDEN, "Show the full help" }, > + {}, > +}; > + > +static error_t parse_arg(int key, char *arg, struct argp_state *state) > +{ > + long rows; > + static int pos_args; > + > + switch (key) { > + case 'C': > + clear_screen = false; > + break; > + case 'v': > + verbose = true; > + break; > + case 'h': > + argp_state_help(state, stderr, ARGP_HELP_STD_HELP); > + break; > + case 'r': > + errno = 0; > + rows = strtol(arg, NULL, 10); > + if (errno || rows <= 0) { > + warn("invalid rows: %s\n", arg); > + argp_usage(state); > + } > + output_rows = rows; > + if (output_rows > OUTPUT_ROWS_LIMIT) > + output_rows = OUTPUT_ROWS_LIMIT; > + break; > + case ARGP_KEY_ARG: > + errno = 0; > + if (pos_args == 0) { > + interval = strtol(arg, NULL, 10); > + if (errno || interval <= 0) { > + warn("invalid interval\n"); > + argp_usage(state); > + } > + } else if (pos_args == 1) { > + count = strtol(arg, NULL, 10); > + if (errno || count <= 0) { > + warn("invalid count\n"); > + argp_usage(state); > + } > + } else { > + warn("unrecognized positional argument: %s\n", arg); > + argp_usage(state); > + } > + pos_args++; > + break; > + default: > + return ARGP_ERR_UNKNOWN; > + } > + return 0; > +} > + > +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args) > +{ > + if (level == LIBBPF_DEBUG && !verbose) > + return 0; > + return vfprintf(stderr, format, args); > +} > + > +static void sig_int(int signo) > +{ > + exiting = 1; > +} > + > +static int print_stat(struct ksyms *ksyms, struct syms_cache *syms_cache, > + struct rcutop_bpf *obj) > +{ > + void *key, **prev_key = NULL; > + int n, err = 0; > + int qfd = bpf_map__fd(obj->maps.cbs_queued); > + int efd = bpf_map__fd(obj->maps.cbs_executed); > + const struct ksym *ksym; > + FILE *f; > + time_t t; > + struct tm *tm; > + char ts[16], buf[256]; > + > + f = fopen("/proc/loadavg", "r"); > + if (f) { > + time(&t); > + tm = localtime(&t); > + strftime(ts, sizeof(ts), "%H:%M:%S", tm); > + memset(buf, 0, sizeof(buf)); > + n = fread(buf, 1, sizeof(buf), f); > + if (n) > + printf("%8s loadavg: %s\n", ts, buf); > + fclose(f); > + } > + > + printf("%-32s %-6s %-6s\n", "Callback", "Queued", "Executed"); > + > + while (1) { > + int qcount = 0, ecount = 0; > + > + err = bpf_map_get_next_key(qfd, prev_key, &key); > + if (err) { > + if (errno == ENOENT) { > + err = 0; > + break; > + } > + warn("bpf_map_get_next_key failed: %s\n", strerror(errno)); > + return err; > + } > + > + err = bpf_map_lookup_elem(qfd, &key, &qcount); > + if (err) { > + warn("bpf_map_lookup_elem failed: %s\n", strerror(errno)); > + return err; > + } > + prev_key = &key; > + > + bpf_map_lookup_elem(efd, &key, &ecount); > + > + ksym = ksyms__map_addr(ksyms, (unsigned long)key); > + printf("%-32s %-6d %-6d\n", > + ksym ? ksym->name : "Unknown", > + qcount, ecount); > + } > + printf("\n"); > + prev_key = NULL; > + while (1) { > + err = bpf_map_get_next_key(qfd, prev_key, &key); > + if (err) { > + if (errno == ENOENT) { > + err = 0; > + break; > + } > + warn("bpf_map_get_next_key failed: %s\n", strerror(errno)); > + return err; > + } > + err = bpf_map_delete_elem(qfd, &key); > + if (err) { > + if (errno == ENOENT) { > + err = 0; > + continue; > + } > + warn("bpf_map_delete_elem failed: %s\n", strerror(errno)); > + return err; > + } > + > + bpf_map_delete_elem(efd, &key); > + prev_key = &key; > + } > + > + return err; > +} > + > +int main(int argc, char **argv) > +{ > + LIBBPF_OPTS(bpf_object_open_opts, open_opts); > + static const struct argp argp = { > + .options = opts, > + .parser = parse_arg, > + .doc = argp_program_doc, > + }; > + struct rcutop_bpf *obj; > + int err; > + struct syms_cache *syms_cache = NULL; > + struct ksyms *ksyms = NULL; > + > + err = argp_parse(&argp, argc, argv, 0, NULL, NULL); > + if (err) > + return err; > + > + libbpf_set_strict_mode(LIBBPF_STRICT_ALL); > + libbpf_set_print(libbpf_print_fn); > + > + err = ensure_core_btf(&open_opts); > + if (err) { > + fprintf(stderr, "failed to fetch necessary BTF for CO-RE: %s\n", strerror(-err)); > + return 1; > + } > + > + obj = rcutop_bpf__open_opts(&open_opts); > + if (!obj) { > + warn("failed to open BPF object\n"); > + return 1; > + } > + > + err = rcutop_bpf__load(obj); > + if (err) { > + warn("failed to load BPF object: %d\n", err); > + goto cleanup; > + } > + > + err = rcutop_bpf__attach(obj); > + if (err) { > + warn("failed to attach BPF programs: %d\n", err); > + goto cleanup; > + } > + > + ksyms = ksyms__load(); > + if (!ksyms) { > + fprintf(stderr, "failed to load kallsyms\n"); > + goto cleanup; > + } > + > + syms_cache = syms_cache__new(0); > + if (!syms_cache) { > + fprintf(stderr, "failed to create syms_cache\n"); > + goto cleanup; > + } > + > + if (signal(SIGINT, sig_int) == SIG_ERR) { > + warn("can't set signal handler: %s\n", strerror(errno)); > + err = 1; > + goto cleanup; > + } > + > + while (1) { > + sleep(interval); > + > + if (clear_screen) { > + err = system("clear"); > + if (err) > + goto cleanup; > + } > + > + err = print_stat(ksyms, syms_cache, obj); > + if (err) > + goto cleanup; > + > + count--; > + if (exiting || !count) > + goto cleanup; > + } > + > +cleanup: > + rcutop_bpf__destroy(obj); > + cleanup_core_btf(&open_opts); > + > + return err != 0; > +} > diff --git a/libbpf-tools/rcutop.h b/libbpf-tools/rcutop.h > new file mode 100644 > index 00000000..cb2a3557 > --- /dev/null > +++ b/libbpf-tools/rcutop.h > @@ -0,0 +1,8 @@ > +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ > +#ifndef __RCUTOP_H > +#define __RCUTOP_H > + > +#define PATH_MAX 4096 > +#define TASK_COMM_LEN 16 > + > +#endif /* __RCUTOP_H */ > -- > 2.36.0.550.gb090851708-goog >