On Mon, Jan 11, 2021 at 7:24 PM Yonghong Song <yhs@xxxxxx> wrote: > > > > On 1/11/21 2:54 PM, Song Liu wrote: > > > > > >> On Jan 11, 2021, at 9:49 AM, Yonghong Song <yhs@xxxxxx> wrote: > >> > >> > >> > >> On 1/8/21 3:19 PM, Song Liu wrote: > >>> Replace hashtab with task local storage in runqslower. This improves the > >>> performance of these BPF programs. The following table summarizes average > >>> runtime of these programs, in nanoseconds: > >>> task-local hash-prealloc hash-no-prealloc > >>> handle__sched_wakeup 125 340 3124 > >>> handle__sched_wakeup_new 2812 1510 2998 > >>> handle__sched_switch 151 208 991 > >>> Note that, task local storage gives better performance than hashtab for > >>> handle__sched_wakeup and handle__sched_switch. On the other hand, for > >>> handle__sched_wakeup_new, task local storage is slower than hashtab with > >>> prealloc. This is because handle__sched_wakeup_new accesses the data for > >>> the first time, so it has to allocate the data for task local storage. > >>> Once the initial allocation is done, subsequent accesses, as those in > >>> handle__sched_wakeup, are much faster with task local storage. If we > >>> disable hashtab prealloc, task local storage is much faster for all 3 > >>> functions. > >>> Signed-off-by: Song Liu <songliubraving@xxxxxx> > >>> --- > >>> tools/bpf/runqslower/runqslower.bpf.c | 26 +++++++++++++++----------- > >>> 1 file changed, 15 insertions(+), 11 deletions(-) > >>> diff --git a/tools/bpf/runqslower/runqslower.bpf.c b/tools/bpf/runqslower/runqslower.bpf.c > >>> index 1f18a409f0443..c4de4179a0a17 100644 > >>> --- a/tools/bpf/runqslower/runqslower.bpf.c > >>> +++ b/tools/bpf/runqslower/runqslower.bpf.c > >>> @@ -11,9 +11,9 @@ const volatile __u64 min_us = 0; > >>> const volatile pid_t targ_pid = 0; > >>> struct { > >>> - __uint(type, BPF_MAP_TYPE_HASH); > >>> - __uint(max_entries, 10240); > >>> - __type(key, u32); > >>> + __uint(type, BPF_MAP_TYPE_TASK_STORAGE); > >>> + __uint(map_flags, BPF_F_NO_PREALLOC); > >>> + __type(key, int); > >>> __type(value, u64); > >>> } start SEC(".maps"); > >>> @@ -25,15 +25,19 @@ struct { > >>> /* record enqueue timestamp */ > >>> __always_inline > >>> -static int trace_enqueue(u32 tgid, u32 pid) > >>> +static int trace_enqueue(struct task_struct *t) > >>> { > >>> - u64 ts; > >>> + u32 pid = t->pid; > >>> + u64 ts, *ptr; > >>> if (!pid || (targ_pid && targ_pid != pid)) > >>> return 0; > >>> ts = bpf_ktime_get_ns(); > >>> - bpf_map_update_elem(&start, &pid, &ts, 0); > >>> + ptr = bpf_task_storage_get(&start, t, 0, > >>> + BPF_LOCAL_STORAGE_GET_F_CREATE); > >>> + if (ptr) > >>> + *ptr = ts; > >>> return 0; > >>> } > >>> @@ -43,7 +47,7 @@ int handle__sched_wakeup(u64 *ctx) > >>> /* TP_PROTO(struct task_struct *p) */ > >>> struct task_struct *p = (void *)ctx[0]; > >>> - return trace_enqueue(p->tgid, p->pid); > >>> + return trace_enqueue(p); > >>> } > >>> SEC("tp_btf/sched_wakeup_new") > >>> @@ -52,7 +56,7 @@ int handle__sched_wakeup_new(u64 *ctx) > >>> /* TP_PROTO(struct task_struct *p) */ > >>> struct task_struct *p = (void *)ctx[0]; > >>> - return trace_enqueue(p->tgid, p->pid); > >>> + return trace_enqueue(p); > >>> } > >>> SEC("tp_btf/sched_switch") > >>> @@ -70,12 +74,12 @@ int handle__sched_switch(u64 *ctx) > >>> /* ivcsw: treat like an enqueue event and store timestamp */ > >>> if (prev->state == TASK_RUNNING) > >>> - trace_enqueue(prev->tgid, prev->pid); > >>> + trace_enqueue(prev); > >>> pid = next->pid; > >>> /* fetch timestamp and calculate delta */ > >>> - tsp = bpf_map_lookup_elem(&start, &pid); > >>> + tsp = bpf_task_storage_get(&start, next, 0, 0); > >>> if (!tsp) > >>> return 0; /* missed enqueue */ > >> > >> Previously, hash table may overflow so we may have missed enqueue. > >> Here with task local storage, is it possible to add additional pid > >> filtering in the beginning of handle__sched_switch such that > >> missed enqueue here can be treated as an error? > > > > IIUC, hashtab overflow is not the only reason of missed enqueue. If the > > wakeup (which calls trace_enqueue) happens before runqslower starts, we > > may still get missed enqueue in sched_switch, no? > > the wakeup won't happen before runqslower starts since runqslower needs > to start to do attachment first and then trace_enqueue() can run. I think Song is right. Given wakeup and sched_switch need to be matched, depending at which exact time we attach BPF programs, we can end up missing wakeup, but not missing sched_switch, no? So it's not an error. > > For the current implementation trace_enqueue() will happen for any non-0 > pid before setting test_progs tgid, and will happen for any non-0 and > test_progs tgid if it is set, so this should be okay if we do filtering > in handle__sched_switch. Maybe you can do an experiment to prove whether > my point is correct or not. > > > > > Thanks, > > Song > >