Re: [PATCH] tracing/user_events: Add eBPF interface for user_event created events

Beau Belgrave <beaub@xxxxxxxxxxxxxxxxxxx> · Wed, 30 Mar 2022 14:27:08 -0700

On Wed, Mar 30, 2022 at 01:39:49PM -0700, Alexei Starovoitov wrote:
> On Wed, Mar 30, 2022 at 12:15 PM Beau Belgrave
> <beaub@xxxxxxxxxxxxxxxxxxx> wrote:
> >
> > On Wed, Mar 30, 2022 at 11:22:32AM -0700, Alexei Starovoitov wrote:
> > > On Wed, Mar 30, 2022 at 9:34 AM Beau Belgrave <beaub@xxxxxxxxxxxxxxxxxxx> wrote:
> > > > > >
> > > > > > But you are fine with uprobe costs? uprobes appear to be much more costly
> > > > > > than a syscall approach on the hardware I've run on.
> > >
> > > Care to share the numbers?
> > > uprobe over USDT is a single trap.
> > > Not much slower compared to syscall with kpti.
> > >
> >
> > Sure, these are the numbers we have from a production device.
> >
> > They are captured via perf via PERF_COUNT_HW_CPU_CYCLES.
> > It's running a 20K loop emitting 4 bytes of data out.
> > Each 4 byte event time is recorded via perf.
> > At the end we have the total time and the max seen.
> >
> > null numbers represent a 20K loop with just perf start/stop ioctl costs.
> >
> > null: min=2863, avg=2953, max=30815
> > uprobe: min=10994, avg=11376, max=146682
> 
> I suspect it's a 3 trap case of uprobe.
> USDT is a nop. It's a 1 trap case.
> 
> > uevent: min=7043, avg=7320, max=95396
> > lttng: min=6270, avg=6508, max=41951
> >
> > These costs include the data getting into a buffer, so they represent
> > what we would see in production vs the trap cost alone. For uprobe this
> > means we created a uprobe and attached it via tracefs to get the above
> > numbers.
> >
> > There also seems to be some thinking around this as well from Song Liu.
> > Link: https://lore.kernel.org/lkml/20200801084721.1812607-1-songliubraving@xxxxxx/
> >
> > From the link:
> > 1. User programs are faster. The new selftest added in 5/5, shows that a
> >    simple uprobe program takes 1400 nanoseconds, while user program only
> >       takes 300 nanoseconds.
> 
> 
> Take a look at Song's code. It's 2 trap case.
> The USDT is a half of that. ~700ns.
> Compared to 300ns of syscall that difference
> could be acceptable.
> 
> >
> > > > >
> > > > > Can we achieve the same/similar performance with sys_bpf(BPF_PROG_RUN)?
> > > > >
> > > >
> > > > I think so, the tough part is how do you let the user-space know which
> > > > program is attached to run? In the current code this is done by the BPF
> > > > program attaching to the event via perf and we run the one there if
> > > > any when data is emitted out via write calls.
> > > >
> > > > I would want to make sure that operators can decide where the user-space
> > > > data goes (perf/ftrace/eBPF) after the code has been written. With the
> > > > current code this is done via the tracepoint callbacks that perf/ftrace
> > > > hook up when operators enable recording via perf, tracefs, libbpf, etc.
> > > >
> > > > We have managed code (C#/Java) where we cannot utilize stubs or traps
> > > > easily due to code movement. So we are limited in how we can approach
> > > > this problem. Having the interface be mmap/write has enabled this
> > > > for us, since it's easy to interact with in most languages and gives us
> > > > lifetime management of the trace objects between user-space and the
> > > > kernel.
> > >
> > > Then you should probably invest into making USDT work inside
> > > java applications instead of reinventing the wheel.
> > >
> > > As an alternative you can do a dummy write or any other syscall
> > > and attach bpf on the kernel side.
> > > No kernel changes are necessary.
> >
> > We only want syscall/tracing overheads for the specific events that are
> > hooked. I don't see how we could hook up a dummy write that is unique
> > per-event without having a way to know when the event is being traced.
> 
> You're adding writev-s to user apps. Keep that writev without
> any user_events on the kernel side and pass -1 as FD.
> Hook bpf prog to sys_writev and filter by pid.

I see. That would have all events incur a syscall cost regardless if a
BPF program is attached or not. We are typically monitoring all processes
so we would not want that overhead on each writev invocation.

We would also have to decode each writev payload to determine if it's
the event we are interested in. The mmap part of user_events solves that
part for us, the byte/bits get set to non-zero when the writev cost is
worth it.

Thanks,
-Beau