On Tue, Feb 26, 2019 at 04:29:45PM +0100, Daniel Borkmann wrote: > On 02/26/2019 05:27 AM, Alexei Starovoitov wrote: > > On 2/25/19 2:36 AM, Daniel Borkmann wrote: > >> > >> Not through the stack, but was more thinking something like low-overhead > >> kprobes-style extension for a BPF prog where such sequence would be added > >> 'inline' at beginning / exit of BPF prog invocation with normal ctx access > >> and helpers as the native program (e.g. time-stamping plus read/write into > >> mark as one example which kprobes couldn't do). But might go much beyond > >> context of this stats collection. > > > > see below. > > > >> That's actually an interesting thought, given the original prog->bpf_func > >> is a known address at that time, this could be templated where an inner > >> dummy bpf_func call to the normal BPF prog gets later replaced with the > >> actual prog->bpf_func address to have no indirect call. This would also > >> allow to keep the actual prog->bpf_func entry call-sites small and simple. > > > > My first thought was that we indeed can have a template of stats > > collecting wrapper in .text, > > then at 'switch stats on' event vmalloc exec region, copy wrapper code > > into it and manually patch 'call foo' x86 insn with direct jump. > > That is still quite involved from coding side. > > It's similar to what kprobe/ftrace is doing, but probably > > an overkill for stats due to potential security > > concerns with new executable code. > > But it won't work due to tail_calls. > > You mean for the scenario to measure the _individual_ progs that > are part of the given tail call processing such that each have > their individual inc in count and time-spent attribution. If so, > yeah, agree. Or for the case you have right now where everything > gets attributed to the "entry" program that triggers the tail > call processing? Latter would be similar as we have now in this > patch, imho. There are two other issues with 'epilogue' approach. 1. We call progs as (*(prog)->bpf_func)(ctx, (prog)->insnsi) and from R2 we can get back to prog and from there go to prog->aux-stats, but there is no room in the stack to save this R2. So we'd need extra 16-bytes for start_ns and saved R2 that epilogue can use to update stats in the 'entry' prog. With static_key approach gcc was smart enough to use callee-saved registers and no extra stack was used in the caller (at least for one callsite of BPF_PROG_RUN where I've looked at asm) 2. when stats are flipped on and off we need to make sure that all bpf progs in prog_array are with stats on or off. Cannot mix and match. Otherwise this 16-byte stack mismatch will cause issues. Which is impossible to do atomically for the whole prog_array. > > With static_key approach the main prog accounts the time > > of all progs called via tail_call. > > We can argue whether it's ideal semantics, but looks like > > it's the only thing we can do. > > Yeah true, it's more about measuring how long the prog 'chain' > in the whole hook is taking to complete, which may be interesting > in itself, just perhaps a bit confusing when looking at individual > progs' share. yep. would be good to come up with a way to measure individual progs, but I couldn't figure it out yet. > It's simplest, yep. I guess if we get to the point of removing > indirect call for BPF_PROG_RUN itself, we might need to revisit > then and move it into bpf_func at that point. Kind of implementation > detail given we only expose count and time spent data to uapi, > though the direct switch at runtime may be a bit problematic in > future. Anyway, lets see how it goes when we get there. yep