Re: [PATCH] net: raise RCU qs after each threaded NAPI poll

Steven Rostedt <rostedt@xxxxxxxxxxx> · Wed, 6 Mar 2024 11:56:07 -0500

On Tue, 5 Mar 2024 09:53:42 -0800
"Paul E. McKenney" <paulmck@xxxxxxxxxx> wrote:

> > Case 1: For !CONFIG_DYNAMIC_FTRACE update of ftrace_trace_function
> > 
> > This config is itself expected to be slow. However seeing what it does, it is
> > trying to make sure the global function pointer "ftrace_trace_function" is
> > updated and any readers of that pointers would have finished reading it. I don't
> > personally think preemption has to be disabled across the entirety of the
> > section that calls into this function. So sensitivity to preempt disabling
> > should not be relevant for this case IMO, but lets see if ftrace folks disagree
> > (on CC). It has more to do with, any callers of this function pointer are no
> > longer calling into the old function.  
> 
> Assuming the loads from the function pointer aren't torn by the compiler,
> they will be loaded by a single instruction, which as you say cannot
> be preempted.  Might be good to have READ_ONCE() if they aren't already
> in place.

See my previous reply about READ_ONCE() logic, but going back to the reason
for rcu_synchronize_rude() we have this:

from kernel/trace/ftrace.c: update_ftrace_function()

> 	ftrace_trace_function = ftrace_ops_list_func;

The reason for switching to ftrace_ops_list_func is because the registered
ftrace_ops is what is to be passed to the function callback. For
non-dynamic ftrace, if we switch ftrace_trace_function to the function,
it's going to get the ftrace_ops of the old function, and that could be
disastrous if the callback is expecting to get its own ftrace_ops.

The ftrace_ops_list_func is an iterator of all the registered callbacks,
and will pass each callback its own ftrace_ops. So by switching to the
list_func, we guarantee that the callback will get its own ftrace_ops.

> 	/*
> 	 * Make sure all CPUs see this. Yes this is slow, but static
> 	 * tracing is slow and nasty to have enabled.
> 	 */
> 	synchronize_rcu_tasks_rude();

Now perhaps we don't need rude here. We just need to make sure that there's
no function that's about to call the old ftrace_trace_function.

> 	/* Now all cpus are using the list ops. */
> 	function_trace_op = set_function_trace_op;

We update the ftrace_ops that gets passed to whatever is registered to
ftrace_trace_function (but ftrace_ops_list_func ignores this parameter, as
it's just a helper function to call other callbacks).

> 	/* Make sure the function_trace_op is visible on all CPUs */
> 	smp_wmb();
> 	/* Nasty way to force a rmb on all cpus */
> 	smp_call_function(ftrace_sync_ipi, NULL, 1);

Needs to make sure everything will see the new ftarce_trace_op.

> 	/* OK, we are all set to update the ftrace_trace_function now! */
> #endif /* !CONFIG_DYNAMIC_FTRACE */
> 
> 	ftrace_trace_function = func;

Finally, we update what will be called by the trampoline, as func expects
to get the set_function_trace_op here. And it should.

> 
> > Case 2: Trampoline structures accessing
> > 
> > For this there is a code comment that says preemption will disabled so
> > it should not be dependent on any of the preemptiblity modes, because
> > preempt_disable() should disable preempt with PREEMPT_AUTO.

If ftrace_ops itself was allocated, meaning that when ftrace_shutdown()
returns, it had better not have anything calling the callback that will
reference the ftrace_ops, because after ftrace_shutdown() returns, it is OK
to free ftrace_ops.

Thus, the trampoline does not call the ftrace_ops callback directly, but
instead once again calls ftrace_ops_list_func! That's because
ftrace_ops_list_func has:

> 	bit = trace_test_and_set_recursion(ip, parent_ip, TRACE_LIST_START);

Where the above disables preemption.

> 	if (bit < 0)
> 		return;
> 
> 	do_for_each_ftrace_op(op, ftrace_ops_list) {
> 		/* Stub functions don't need to be called nor tested */
> 		if (op->flags & FTRACE_OPS_FL_STUB)
> 			continue;
> 		/*
> 		 * Check the following for each ops before calling their func:
> 		 *  if RCU flag is set, then rcu_is_watching() must be true
> 		 *  Otherwise test if the ip matches the ops filter
> 		 *
> 		 * If any of the above fails then the op->func() is not executed.
> 		 */
> 		if ((!(op->flags & FTRACE_OPS_FL_RCU) || rcu_is_watching()) &&
> 		    ftrace_ops_test(op, ip, regs)) {

If the callback expects RCU to be watching, then it is skipped if RCU is
not watching. If we have fixed all the places that function tracing can
happen to make sure RCU is always watching, we could possibly get rid of
this code.

This may be the case, because trace_test_and_set_recursion() has:

#ifdef CONFIG_ARCH_WANTS_NO_INSTR
# define trace_warn_on_no_rcu(ip)					\
	({								\
		bool __ret = !rcu_is_watching();			\
		if (__ret && !trace_recursion_test(TRACE_RECORD_RECURSION_BIT)) { \
			trace_recursion_set(TRACE_RECORD_RECURSION_BIT); \
			WARN_ONCE(true, "RCU not on for: %pS\n", (void *)ip); \
			trace_recursion_clear(TRACE_RECORD_RECURSION_BIT); \
		}							\
		__ret;							\
	})
#else
# define trace_warn_on_no_rcu(ip)	false
#endif

> 			if (FTRACE_WARN_ON(!op->func)) {
> 				pr_warn("op=%p %pS\n", op, op);
> 				goto out;
> 			}
> 			op->func(ip, parent_ip, op, fregs);
> 		}
> 	} while_for_each_ftrace_op(op);
> out:
> 	trace_clear_recursion(bit);

Thus, if RCU is always watching for all function calls, then we can remove
the rude here. Perhaps we could remove rude everywhere?

> > 
> > 		/*
> > 		 * We need to do a hard force of sched synchronization.
> > 		 * This is because we use preempt_disable() to do RCU,
> > but
> > 		 * the function tracers can be called where RCU is not
> > watching
> > 		 * (like before user_exit()). We can not rely on the RCU
> > 		 * infrastructure to do the synchronization, thus we
> > must do it
> > 		 * ourselves.
> > 		 */
> > 		synchronize_rcu_tasks_rude();
> > 		[...]
> > 		ftrace_trampoline_free(ops);
> > 
> > Code comment probably needs update because it says 'can not rely on
> > RCU..' ;-)  

Can not rely on normal RCU ;-)

> 
> My guess is that this comment is left over from when that call to
> synchronize_rcu_tasks_rude() was open-coded.  ;-)

Correct.

> 
> Maybe "We can not rely on vanilla RCU to do..."?

If RCU is always watching, then we can. But is that the case for all archs?

> 
> > My *guess* is the preempt_disable() mentioned in this case is
> > ftrace_ops_trampoline() where trampoline-related datas tructures are
> > accessed for stack unwinding purposes. This is a data structure
> > protection thing AFAICS and nothing to do with "trampoline execution"
> > itself which needs "Tasks RCU" to allow for preemption in trampolines.  
> 
> Sounds plausible to me, but let's see what Steve's thoughts are.

As mentioned above, the preempt_disable() is in the ftrace_ops_list_func()
that these callbacks are forced to go through.

> 
> > Case 3: This has to do with update of function graph tracing and there
> > is the same comment as case 2, where preempt will be disabled in
> > readers, so it should be safe for PREEMPT_AUTO (famous last words).

The preempt_disable here is in kernel/trace/trace.h:

  ftrace_graph_addr()

-- Steve

> > 
> > Though I am not yet able to locate that preempt_disable() which is not
> > an PREEMPT_AUTO-related issue anyway. Maybe its buried in function
> > graph tracing logic somewhere?  
> 
> With the trampolines, isn't synchronize_rcu_tasks_rude() paired with
> a call to synchronize_rcu_tasks()?  In that case, rude's only job is
> getting all CPUs out of their previous sojourn in either the entry/exit
> code or the deep idle loop.  RCU Tasks waits for each task to voluntarily
> block, which takes care of all tasks executing elsewhere.  (Recall that
> RCU Tasks ignores the idle tasks.)
> 
> > Finally, my thought also was, if any of these thread usages/cases of
> > Tasks RCU RUDE assume working only on a CONFIG_PREEMPT_NONE=y or
> > CONFIG_PREEMPT_VOLUNTARY=y kernel, that could be worrying but AFAICS,
> > they don't assume anything related to that.  
> 
> Good point, most generic code should need to tolerate preemption in
> any case.  But I have nine commits queued thus far that handle some
> CONFIG_AUTO breakage or another, so a little paranoia won't go too
> far amiss.  ;-)
> 
> Remaining on my list are uses of the various CONFIG_PREEMPT* Kconfig
> options, both in code and in Makefiles.  Though who knows?  Perhaps Ankur
> or Thomas have already done that.