Re: [RFC bpf-next 0/2] sharing bpf runtime stats with /dev/bpf_stats

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Sat, 14 Mar 2020 08:57:03 -0700

On Sat, Mar 14, 2020 at 03:47:50AM +0000, Song Liu wrote:
> 
> 
> > On Mar 13, 2020, at 7:43 PM, Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote:
> > 
> > On Fri, Mar 13, 2020 at 05:35:16PM -0700, Song Liu wrote:
> >> Motivation (copied from 2/2):
> >> 
> >> ======================= 8< =======================
> >> Currently, sysctl kernel.bpf_stats_enabled controls BPF runtime stats.
> >> Typical userspace tools use kernel.bpf_stats_enabled as follows:
> >> 
> >>  1. Enable kernel.bpf_stats_enabled;
> >>  2. Check program run_time_ns;
> >>  3. Sleep for the monitoring period;
> >>  4. Check program run_time_ns again, calculate the difference;
> >>  5. Disable kernel.bpf_stats_enabled.
> >> 
> >> The problem with this approach is that only one userspace tool can toggle
> >> this sysctl. If multiple tools toggle the sysctl at the same time, the
> >> measurement may be inaccurate.
> >> 
> >> To fix this problem while keep backward compatibility, introduce
> >> /dev/bpf_stats. sysctl kernel.bpf_stats_enabled will only change the
> >> lowest bit of the static key. /dev/bpf_stats, on the other hand, adds 2
> >> to the static key for each open fd. The runtime stats is enabled when
> >> kernel.bpf_stats_enabled == 1 or there is open fd to /dev/bpf_stats.
> >> 
> >> With /dev/bpf_stats, user space tool would have the following flow:
> >> 
> >>  1. Open a fd to /dev/bpf_stats;
> >>  2. Check program run_time_ns;
> >>  3. Sleep for the monitoring period;
> >>  4. Check program run_time_ns again, calculate the difference;
> >>  5. Close the fd.
> >> ======================= 8< =======================
> >> 
> >> 1/2 adds a few new API to jump_label.
> >> 2/2 adds the /dev/bpf_stats and adjust kernel.bpf_stats_enabled handler.
> >> 
> >> Please share your comments.
> > 
> > Conceptually makes sense to me. Few comments:
> > 1. I don't understand why +2 logic is necessary.
> > Just do +1 for every FD and change proc_do_static_key() from doing
> > explicit enable/disable to do +1/-1 as well on transition from 0->1 and 1->0.
> > The handler would need to check that 1->1 and 0->0 is a nop.
> 
> With the +2/-2 logic, we use the lowest bit of the counter to remember 
> the value of the sysctl. Otherwise, we cannot tell whether we are making
> 0->1 transition or 1->1 transition. 

that can be another static int var in the handler.
and no need for patch 1.

> > 
> > 2. /dev is kinda awkward. May be introduce a new bpf command that returns fd?
> 
> Yeah, I also feel /dev is awkward. fd from bpf command sounds great. 
> 
> > 
> > 3. Instead of 1 and 2 tweak sysctl to do ++/-- unconditionally?
> > Like repeated sysctl kernel.bpf_stats_enabled=1 will keep incrementing it
> > and would need equal amount of sysctl kernel.bpf_stats_enabled=0 to get
> > it back to zero where it will stay zero even if users keep spamming
> > sysctl kernel.bpf_stats_enabled=0.
> > This way current services that use sysctl will keep working as-is.
> > Multiple services that currently collide on sysctl will magically start
> > working without any changes to them. It is still backwards compatible.
> 
> I think this is not fully backwards compatible. With current logic, the 
> following sequence disables stats eventually. 
> 
>   sysctl kernel.bpf_stats_enabled=1
>   sysctl kernel.bpf_stats_enabled=1
>   sysctl kernel.bpf_stats_enabled=0
> 
> The same sequence will not disable stats with the ++/-- sysctl. 

sure, but if a process holding an fd 'sysctl kernel.bpf_stats_enabled=0'
won't disable stats either. So it's also not backwards compatible. imo it's a
change in behavior whichever way, but either approach doesn't break user space.
An advantage of not doing an fd is that some user that really wants to have
stats disabled for performance benchmarking can do
'sysctl kernel.bpf_stats_enabled=0' few times and the stats will be off.
We can also make 'sysctl kernel.bpf_stats_enabled' to return current counter,
so humans can see how many daemons are doing stats collection at that very
moment.
We can also do both new fd via bpf syscall and ++/-- via sysctl, but imo
++/-- via sysctl is enough to address the issue of multiple stats collecting
daemons. The patch would be small enough that we can push it via bpf tree
and into older kernels as arguable 'fix'.