Re: [PATCH bpf-next 1/4] bpf: enable program stats

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Tue, 26 Feb 2019 20:05:26 -0800

On Tue, Feb 26, 2019 at 04:29:45PM +0100, Daniel Borkmann wrote:
> On 02/26/2019 05:27 AM, Alexei Starovoitov wrote:
> > On 2/25/19 2:36 AM, Daniel Borkmann wrote:
> >>
> >> Not through the stack, but was more thinking something like low-overhead
> >> kprobes-style extension for a BPF prog where such sequence would be added
> >> 'inline' at beginning / exit of BPF prog invocation with normal ctx access
> >> and helpers as the native program (e.g. time-stamping plus read/write into
> >> mark as one example which kprobes couldn't do). But might go much beyond
> >> context of this stats collection.
> > 
> > see below.
> > 
> >> That's actually an interesting thought, given the original prog->bpf_func
> >> is a known address at that time, this could be templated where an inner
> >> dummy bpf_func call to the normal BPF prog gets later replaced with the
> >> actual prog->bpf_func address to have no indirect call. This would also
> >> allow to keep the actual prog->bpf_func entry call-sites small and simple.
> > 
> > My first thought was that we indeed can have a template of stats
> > collecting wrapper in .text,
> > then at 'switch stats on' event vmalloc exec region, copy wrapper code
> > into it and manually patch 'call foo' x86 insn with direct jump.
> > That is still quite involved from coding side.
> > It's similar to what kprobe/ftrace is doing, but probably
> > an overkill for stats due to potential security
> > concerns with new executable code.
> > But it won't work due to tail_calls.
> 
> You mean for the scenario to measure the _individual_ progs that
> are part of the given tail call processing such that each have
> their individual inc in count and time-spent attribution. If so,
> yeah, agree. Or for the case you have right now where everything
> gets attributed to the "entry" program that triggers the tail
> call processing? Latter would be similar as we have now in this
> patch, imho.

There are two other issues with 'epilogue' approach.
1.
We call progs as (*(prog)->bpf_func)(ctx, (prog)->insnsi)
and from R2 we can get back to prog and from there go to prog->aux-stats,
but there is no room in the stack to save this R2.
So we'd need extra 16-bytes for start_ns and saved R2
that epilogue can use to update stats in the 'entry' prog.
With static_key approach gcc was smart enough to use callee-saved
registers and no extra stack was used in the caller
(at least for one callsite of BPF_PROG_RUN where I've looked at asm)

2.
when stats are flipped on and off we need to make sure that all bpf progs
in prog_array are with stats on or off. Cannot mix and match.
Otherwise this 16-byte stack mismatch will cause issues.
Which is impossible to do atomically for the whole prog_array.

> > With static_key approach the main prog accounts the time
> > of all progs called via tail_call.
> > We can argue whether it's ideal semantics, but looks like
> > it's the only thing we can do.
> 
> Yeah true, it's more about measuring how long the prog 'chain'
> in the whole hook is taking to complete, which may be interesting
> in itself, just perhaps a bit confusing when looking at individual
> progs' share.

yep. would be good to come up with a way to measure individual progs,
but I couldn't figure it out yet.

> It's simplest, yep. I guess if we get to the point of removing
> indirect call for BPF_PROG_RUN itself, we might need to revisit
> then and move it into bpf_func at that point. Kind of implementation
> detail given we only expose count and time spent data to uapi,
> though the direct switch at runtime may be a bit problematic in
> future. Anyway, lets see how it goes when we get there.

yep