On Fri, Jan 16, 2015 at 7:02 AM, Steven Rostedt <rostedt@xxxxxxxxxxx> wrote: > On Thu, 15 Jan 2015 20:16:01 -0800 > Alexei Starovoitov <ast@xxxxxxxxxxxx> wrote: > >> Hi Ingo, Steven, >> >> This patch set is based on tip/master. > > Note, the tracing code isn't maintained in tip/master, but perf code is. I know. I can rebase against linux-trace tree, but wanted to go through tip to get advantage of tip-bot ;) > Do you have a git repo somewhere that I can look at? It makes it easier > than loading in 9 patches ;-) https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/ > For syscalls this is fine as the parameters are usually set. But > there's a lot of tracepoints that we need to know the result of the > copied data to decide to filter or not, where the result happens at the > TP_fast_assign() part which requires allocating the buffers. ... > Again, for syscalls it may not be a problem, but for other tracepoints, > I'm not sure we can do that. How do you handle sched_switch for > example? The tracepoint only gets two pointers to task structs, you > need to then dereference them to get the pid, prio, state and other > data. exactly. In this patch set the user can use bpf_fetch_*() helpers to dereference task_struct to get to pid, prio, anything. The user is not limited by what tracepoints hard coded as part of TP_fast_assign. In the future when generic type/field infra is ready, the bpf program will have faster way to access such fields without going through bpf_fetch_*() helpers. In other words the program is a superset of existing TP_fast_assign + TP_print + filters + triggers + stack dumps. All are done from the program on demand. Some programs may just count number of events without accessing arguments at all. > Maybe we should have a way to do the program before and/or after the > buffering depending on what to filter on. There's no way to know what > the parameters of the tracepoint are without looking at the source. right now most of the tracepoints copy as much as possible fields as part of TP_fast_assign, since there is no way to know what will be needed later. And despite copying a lot, often it's not enough for analytics. With program is attached before the copy, the overhead is much lower. The program will look at whatever fields necessary, may do some operations on them and even store these fields into maps for further processing. >> - dump_stack >> - trace_printk >> The last two are mainly to debug the programs and to print data for user >> space consumptions. > > I have to look at the code, but currently trace_printk() isn't made to > be used in production systems. other than allocating a bunch of per_cpu pages. what concerns do you have? Some printk-like facility from the program is needed. I'd rather fix whatever necessary in trace_printk instead of inventing another way of printing. Anyway, I think I will drop these two helpers for now. > One last thing. If the ebpf is used for anything but filtering, it > should go into the trigger file. The filtering is only a way to say if > the event should be recorded or not. But the trigger could do something > else (a printk, a stacktrace, etc). it does way more than just filtering, but invoking program as a trigger is too slow. When program is called as soon as tracepoint fires, it can fetch other fields, evaluate them, printk some of them, optionally dump stack, aggregate into maps. We can let it call triggers too, so that user program will be able to enable/disable other events. I'm not against invoking programs as a trigger, but I don't see a use case for it. It's just too slow for production analytics that needs to act on huge number of events per second. We must minimize the overhead between tracepoint firing and program executing, so that programs can be used on events like packet receive which will be in millions per second. Every nsec counts. For example: - raw dd if=/dev/zero of=/dev/null does 760 MB/s (on my debug kernel) - echo 1 > events/syscalls/sys_enter_write/enable drops it to 400 MB/s - echo "echo "count == 123 " > events/syscalls/sys_enter_write/filter drops it even further down to 388 MB/s This slowdown is too high for this to be used on a live system. - tracex4 that computes histogram of sys_write sizes and stores log2(count) into a map does 580 MB/s This is still not great, but this slowdown is now usable and we can work further on minimizing the overhead. -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html