Re: [PATCH v1 2/8] tracing/ftrace: guard syscall probe with preempt_notrace

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Thu, 3 Oct 2024 21:33:16 -0400

On 2024-10-04 03:04, Steven Rostedt wrote:
On Thu, 3 Oct 2024 20:26:29 -0400
Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote:

static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
{
          struct trace_array *tr = data;
          struct trace_event_file *trace_file;
          struct syscall_trace_enter *entry;
          struct syscall_metadata *sys_data;
          struct trace_event_buffer fbuffer;
          unsigned long args[6];
          int syscall_nr;
          int size;

          syscall_nr = trace_get_syscall_nr(current, regs);
          if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
                  return;

          /* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE) */
          trace_file = rcu_dereference_sched(tr->enter_syscall_files[syscall_nr]);

^^^^ this function explicitly states that preempt needs to be disabled by
tracepoints.

Ah, I should have known it was the syscall portion. I don't care for this
hidden dependency. I rather add a preempt disable here and not expect it to
be disabled when called.

Which is exactly what this patch is doing.

          if (!trace_file)
                  return;

          if (trace_trigger_soft_disabled(trace_file))
                  return;

          sys_data = syscall_nr_to_meta(syscall_nr);
          if (!sys_data)
                  return;

          size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;

          entry = trace_event_buffer_reserve(&fbuffer, trace_file, size);
^^^^ it reserves space in the ring buffer without disabling preemption explicitly.

And also:

void *trace_event_buffer_reserve(struct trace_event_buffer *fbuffer,
                                   struct trace_event_file *trace_file,
                                   unsigned long len)
{
          struct trace_event_call *event_call = trace_file->event_call;

          if ((trace_file->flags & EVENT_FILE_FL_PID_FILTER) &&
              trace_event_ignore_this_pid(trace_file))
                  return NULL;

          /*
           * If CONFIG_PREEMPTION is enabled, then the tracepoint itself disables
           * preemption (adding one to the preempt_count). Since we are
           * interested in the preempt_count at the time the tracepoint was
           * hit, we need to subtract one to offset the increment.
           */
^^^ This function also explicitly expects preemption to be disabled.

So I rest my case. The change I'm introducing for tracepoints
don't make any assumptions about whether or not each tracer require
preempt off or not: it keeps the behavior the _same_ as it was before.

Then it's up to each tracer's developer to change the behavior of their
own callbacks as they see fit. But I'm not introducing regressions in
tracers with the "big switch" change of making syscall tracepoints
faultable. This will belong to changes that are specific to each tracer.

I rather remove these dependencies at the source. So, IMHO, these places
should be "fixed" first.

At least for the ftrace users. But I think the same can be done for the
other users as well. BPF already stated it just needs "migrate_disable()".
Let's see what perf has.

We can then audit all the tracepoint users to make sure they do not need
preemption disabled.

Why does it need to be a broad refactoring of the entire world ? What is
wrong with the simple approach of introducing this tracepoint faultable
syscall support as a no-op from the tracer's perspective ?

Then we can build on top and figure out if we want to relax things
on a tracer-per-tracer basis.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com