Re: [PATCH] tracing/user_events: Add eBPF interface for user_event created events

Beau Belgrave <beaub@xxxxxxxxxxxxxxxxxxx> · Wed, 30 Mar 2022 14:24:02 -0700

On Wed, Mar 30, 2022 at 03:57:26PM -0400, Mathieu Desnoyers wrote:
> ----- On Mar 30, 2022, at 3:15 PM, Beau Belgrave beaub@xxxxxxxxxxxxxxxxxxx wrote:
> 
> > On Wed, Mar 30, 2022 at 11:22:32AM -0700, Alexei Starovoitov wrote:
> >> On Wed, Mar 30, 2022 at 9:34 AM Beau Belgrave <beaub@xxxxxxxxxxxxxxxxxxx> wrote:
> >> > > >
> >> > > > But you are fine with uprobe costs? uprobes appear to be much more costly
> >> > > > than a syscall approach on the hardware I've run on.
> >> 
> >> Care to share the numbers?
> >> uprobe over USDT is a single trap.
> >> Not much slower compared to syscall with kpti.
> >> 
> > 
> > Sure, these are the numbers we have from a production device.
> > 
> > They are captured via perf via PERF_COUNT_HW_CPU_CYCLES.
> > It's running a 20K loop emitting 4 bytes of data out.
> > Each 4 byte event time is recorded via perf.
> > At the end we have the total time and the max seen.
> > 
> > null numbers represent a 20K loop with just perf start/stop ioctl costs.
> > 
> > null: min=2863, avg=2953, max=30815
> > uprobe: min=10994, avg=11376, max=146682
> > uevent: min=7043, avg=7320, max=95396
> > lttng: min=6270, avg=6508, max=41951
> > 
> > These costs include the data getting into a buffer, so they represent
> > what we would see in production vs the trap cost alone. For uprobe this
> > means we created a uprobe and attached it via tracefs to get the above
> > numbers.
> 
> [...]
> 
> I assume here that by "lttng" you specifically refer to lttng-ust (LTTng's
> user-space tracer), am I correct ?
> 

Yes, this is ust.

> By removing the "null" baseline overhead, my rough calculations are that the
> average overhead for lttng-ust in your results is (in cpu cycles):
> 
> 6270-2863 = 3555
> 
> So I'm unsure what is the frequency of your CPU, but guessing around 3.5GHz
> this is in the area of 1 microsecond. On an Intel CPU, this is much larger
> than what I would expect. 
> 
> Can you share your test program, hardware characteristics, kernel version,
> glibc version, and whether the program is compiled as a 32-bit or 64-bit
> binary ?
> 

This is how we are measuring:

#define PERF_START() \
do \
{ \
        ioctl(perf_fd, PERF_EVENT_IOC_RESET, 0); \
	ioctl(perf_fd, PERF_EVENT_IOC_ENABLE, 0); \
} while (0)

#define PERF_END(__cycles) \
do \
{ \
	ioctl(perf_fd, PERF_EVENT_IOC_DISABLE, 0); \
	read(perf_fd, &__cycles, sizeof(__cycles)); \
} while (0)

struct perf_event_attr pe;
long long min, max, total;
int i, perf_fd;

memset(&pe, 0, sizeof(pe));

pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof(pe);
pe.config = PERF_COUNT_HW_CPU_CYCLES;
pe.disabled = 1;
pe.exclude_hv = 1;

perf_fd = perf_event_open(&pe, 0, -1, -1, 0); 

min = max = total = 0;

for (i = 0; i < 20000; ++i)
{
	long long cycles;

	PERF_START();
	probe();
	PERF_END(cycles);

	if (i == 0 || cycles < min)
		min = cycles;

	if (cycles > max)
		max = cycles;

	total += cycles;
}

probe() here could be a call to writev or to the lttng trace call.

> Can you confirm that lttng-ust is not calling one getcpu system call per
> event ? This might be the case if run a 32-bit x86 binary and have a
> glibc < 2.35, or a kernel too old to provide CONFIG_RSEQ or don't have
> CONFIG_RSEQ=y in your kernel configuration. You can validate this by
> running your lttng-ust test program with a system call tracer.
> 

We don't have CONFIG_RSEQ, so that is likely the cause. LTTng is always
going to be the fastest thing out there. It's pure user mode :)

> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

Thanks,
-Beau