On Wed, Apr 06 2022 at 15:54, Matthew Wilcox wrote: > On Wed, Apr 06, 2022 at 10:52:38AM +0800, Liao Chang wrote: >> Kernel check for pending softirqs periodically, they are performed in a >> few points of kernel code, such as irq_exit() and __local_bh_enable_ip(), >> softirqs that have been activated by a given CPU must be executed on the >> same CPU, this characteristic of softirq is always a potentially >> "dangerous" operation, because one CPU might be end up very busy while >> the other are most idle. >> >> Above concern is proven in a networking user case: recenlty, we >> engineer find out the time used for connection re-establishment on >> kernel v5.10 is 300 times larger than v4.19, meanwhile, softirq >> monopolize almost 99% of CPU. This problem stem from that the connection >> between Sender and Receiver node get lost, the NIC driver on Sender node >> will keep raising NET_TX softirq before connection recovery. The system >> log show that most of softirq is performed from __local_bh_enable_ip(), >> since __local_bh_enable_ip is used widley in kernel code, it is very >> easy to run out most of CPU, and the user-mode application can't obtain >> enough CPU cycles to establish connection as soon as possible. > > Shouldn't you fix that bug instead? This seems like papering over the > bad effects of a bug and would make it harder to find bugs like this in > the future. Essentially, it's the same as a screaming hardware interrupt, > except that it's a software interrupt, so we can fix the bug instead of > working around broken hardware. It's not necessarily broken hardware. It's a fundamental issue of our softirq processing magic which can happen in those contexts: 1) On return from interrupt 2) In local_bh_enable() 3) In ksoftirqd We have heuristics in place which delegate processing to ksoftirqd, which brings softirq processing under scheduler control to some extent, but those heuristics are rather easy to evade. Delegation to ksoftirqd happens when the runtime of the __do_softirq() loop exceeds a threshold. But if that is not exceeded then you still can get into a situation where softirq processing eats up a large quantity of CPU time in #1 and #2 which is the real problem because it prevents the scheduler from applying fairness. That's a known issue and attempts to fix that are popping up on a regular base. There are several issues here: 1) The runtime check in __do_softirq() is jiffies based and depending on CONFIG_HZ it's easy to stay under the threshold for one invocation, but still eat a large amount of CPU time. Also the runtime check happens at the end of the loop, which means that if a single softirq callback runs too long we still process all other pending ones. 2) The decision to process softirqs directly on return from interrupt or in local_bh_enable() is error prone. The logic is: if (!ksoftirq_running() || local_softirq_pending() & (SOFTIRQ_HI | SOFTIRQ_TASKLET)) process_direct(); The reason for the HI/TASKLET exception is documented here: 3c53776e29f8 ("Mark HI and TASKLET softirq synchronous") But this is nasty because tasklets are widely used in networking, crypto and quite some of them are self rearming when they take too long. See mlx5_cq_tasklet_cb() as one example, which also uses jiffies for time limiting... With the HI/TASKLET exception this means that there is no delegation to ksoftirqd simply because on the next return from interrupt or the next local_bh_enable() in task context softirqs are processed. And this processes _all_ pending bits not only the HI/TASKLET ones... The approach of just not running softirqs at all via a throttle mechanism as proposed with these patches here, is definitely wrong and going nowhere. The proper solution for load accounting is a moving average with exponential decay based on sched_clock() and not on jiffies. That gives a reasonable decision to enforce ksoftirqd processing, but of course that does neither solve the tasklet issues nor any of the other problems vs. softirqs at all. We've tried splitting ksoftirqd into different threads a couple of years ago in RT, but that turned out to be problematic in some cases. Frederic did some experiments to make local_bh_disable() take a mask argument to disable only particular softirqs, but that ran into a dead end and is problematic because quite some code, esp. networking relies on multiple softirqs being disabled. Softirqs are semantically ill defined and that's known for a very long time, but of course they are conveniant and with a few hacks piled on top to address the most urgent horrors they work by some definition of work. IOW, we are accumulating technical depth with a fast pace. TBH, I have no real good plan how to address this proper, but it's about time to tackle this in a concerted effort. Thanks, tglx