On Wed, Mar 30, 2022 at 03:57:26PM -0400, Mathieu Desnoyers wrote: > ----- On Mar 30, 2022, at 3:15 PM, Beau Belgrave beaub@xxxxxxxxxxxxxxxxxxx wrote: > > > On Wed, Mar 30, 2022 at 11:22:32AM -0700, Alexei Starovoitov wrote: > >> On Wed, Mar 30, 2022 at 9:34 AM Beau Belgrave <beaub@xxxxxxxxxxxxxxxxxxx> wrote: > >> > > > > >> > > > But you are fine with uprobe costs? uprobes appear to be much more costly > >> > > > than a syscall approach on the hardware I've run on. > >> > >> Care to share the numbers? > >> uprobe over USDT is a single trap. > >> Not much slower compared to syscall with kpti. > >> > > > > Sure, these are the numbers we have from a production device. > > > > They are captured via perf via PERF_COUNT_HW_CPU_CYCLES. > > It's running a 20K loop emitting 4 bytes of data out. > > Each 4 byte event time is recorded via perf. > > At the end we have the total time and the max seen. > > > > null numbers represent a 20K loop with just perf start/stop ioctl costs. > > > > null: min=2863, avg=2953, max=30815 > > uprobe: min=10994, avg=11376, max=146682 > > uevent: min=7043, avg=7320, max=95396 > > lttng: min=6270, avg=6508, max=41951 > > > > These costs include the data getting into a buffer, so they represent > > what we would see in production vs the trap cost alone. For uprobe this > > means we created a uprobe and attached it via tracefs to get the above > > numbers. > > [...] > > I assume here that by "lttng" you specifically refer to lttng-ust (LTTng's > user-space tracer), am I correct ? > Yes, this is ust. > By removing the "null" baseline overhead, my rough calculations are that the > average overhead for lttng-ust in your results is (in cpu cycles): > > 6270-2863 = 3555 > > So I'm unsure what is the frequency of your CPU, but guessing around 3.5GHz > this is in the area of 1 microsecond. On an Intel CPU, this is much larger > than what I would expect. > > Can you share your test program, hardware characteristics, kernel version, > glibc version, and whether the program is compiled as a 32-bit or 64-bit > binary ? > This is how we are measuring: #define PERF_START() \ do \ { \ ioctl(perf_fd, PERF_EVENT_IOC_RESET, 0); \ ioctl(perf_fd, PERF_EVENT_IOC_ENABLE, 0); \ } while (0) #define PERF_END(__cycles) \ do \ { \ ioctl(perf_fd, PERF_EVENT_IOC_DISABLE, 0); \ read(perf_fd, &__cycles, sizeof(__cycles)); \ } while (0) struct perf_event_attr pe; long long min, max, total; int i, perf_fd; memset(&pe, 0, sizeof(pe)); pe.type = PERF_TYPE_HARDWARE; pe.size = sizeof(pe); pe.config = PERF_COUNT_HW_CPU_CYCLES; pe.disabled = 1; pe.exclude_hv = 1; perf_fd = perf_event_open(&pe, 0, -1, -1, 0); min = max = total = 0; for (i = 0; i < 20000; ++i) { long long cycles; PERF_START(); probe(); PERF_END(cycles); if (i == 0 || cycles < min) min = cycles; if (cycles > max) max = cycles; total += cycles; } probe() here could be a call to writev or to the lttng trace call. > Can you confirm that lttng-ust is not calling one getcpu system call per > event ? This might be the case if run a 32-bit x86 binary and have a > glibc < 2.35, or a kernel too old to provide CONFIG_RSEQ or don't have > CONFIG_RSEQ=y in your kernel configuration. You can validate this by > running your lttng-ust test program with a system call tracer. > We don't have CONFIG_RSEQ, so that is likely the cause. LTTng is always going to be the fastest thing out there. It's pure user mode :) > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > http://www.efficios.com Thanks, -Beau