Re: Splat in kernel RT while processing incoming network packets

Paolo Abeni <pabeni@xxxxxxxxxx> · Tue, 04 Jul 2023 12:29:33 +0200

On Tue, 2023-07-04 at 12:05 +0200, Sebastian Andrzej Siewior wrote:
> On 2023-07-03 18:15:58 [-0300], Wander Lairson Costa wrote:
> > > Not sure how to proceed. One thing you could do is a hack similar like
> > > net-Avoid-the-IPI-to-free-the.patch which does it for defer_csd.
> > 
> > At first sight it seems straightforward to implement.
> > 
> > > On the other hand we could drop net-Avoid-the-IPI-to-free-the.patch and
> > > remove the warning because we have now commit
> > >    d15121be74856 ("Revert "softirq: Let ksoftirqd do its job"")
> > 
> > But I am more in favor of a solution that removes code than one that
> > adds more :)
> 
> Raising the softirq from anonymous (hardirq context) is not ideal for
> the reasons I stated below.
> 
> > > Prior that, raising softirq from hardirq would wake ksoftirqd which in
> > > turn would collect all pending softirqs. As a consequence all following
> > > softirqs (networking, …) would run as SCHED_OTHER and compete with
> > > SCHED_OTHER tasks for resources. Not good because the networking work is
> > > no longer processed within the networking interrupt thread. Also not a
> > > DDoS kind of situation where one could want to delay processing.
> > > 
> > > With that change, this isn't the case anymore. Only an "unrelated" IRQ
> > > thread could pick up the networking work which is less then ideal. That
> > > is because the global softirq set is added, ksoftirq is marked for a
> > > wakeup and could be delayed because other tasks are busy. Then the disk
> > > interrupt (for instance) could pick it up as part of its threaded
> > > interrupt.
> > > 
> > > Now that I think about, we could make the backlog pseudo device a
> > > thread. NAPI threading enables one thread but here we would need one
> > > thread per-CPU. So it would remain kind of special. But we would avoid
> > > clobbering the global state and delay everything to ksoftird. Processing
> > > it in ksoftirqd might not be ideal from performance point of view.
> > 
> > Before sending this to the ML, I talked to Paolo about using NAPI
> > thread. He explained that it is implemented per interface. For example,
> > for this specific case, it happened on the loopback interface, which
> > doesn't implement NAPI. I am cc'ing him, so the can correct me if I am
> > saying something wrong.
> 
> It is per NAPI-queue/instance and you could have multiple instances per
> interface. However loopback has one and you need per-CPU threads if you
> want to RPS your skbs to any CPU.

Just to hopefully clarify the networking side of it, napi instances !=
network backlog (used by RPS). The network backlog (RPS) is available
for all the network devices, including the loopback and all the virtual
ones. 

The napi instances (and the threaded mode) are available only on
network device drivers implementing the napi model. The loopback driver
does not implement the napi model, as most virtual devices and even
some H/W NICs (mostily low end ones).

The network backlog can't run in threaded mode: there is no API/sysctl
nor infrastructure for that. The backlog processing threaded mode could
be implemented, even if should not be completely trivial and it sounds
a bit weird to me.

Just for the records, I mentioned the following in the bz:

It looks like flush_smp_call_function_queue() has 2 only callers,
migration, and do_idle().

What about moving softirq processing from
flush_smp_call_function_queue() into cpu_stopper_thread(), outside the
unpreemptable critical section?

I *think*/wild guess the call from do_idle() could be just removed (at
least for RT build), as according to:

commit b2a02fc43a1f40ef4eb2fb2b06357382608d4d84
Author: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Date:   Tue May 26 18:11:01 2020 +0200

	smp: Optimize send_call_function_single_ipi()

is just an optimization.

Cheers,

Paolo