Paul, On Fri, Aug 14 2020 at 11:01, Paul E. McKenney wrote: > On Fri, Aug 14, 2020 at 04:06:04PM +0200, Michal Hocko wrote: >> > > > Vlastimil raised same question earlier, i answered, but let me answer again: >> > > > >> > > > It is hard to achieve because the logic does not stick to certain static test >> > > > case, i.e. it depends on how heavily kfree_rcu(single/double) are used. Based >> > > > on that, "how heavily" - number of pages are formed, until the drain/reclaimer >> > > > thread frees them. >> > > >> > > How many pages are talking about - ball park? 100s, 1000s? >> > >> > Under normal operation, a couple of pages per CPU, which would make >> > preallocation entirely reasonable. Except that if someone does something >> > that floods RCU callbacks (close(open) in a tight userspace loop, for but >> > one example), then 2000 per CPU might not be enough, which on a 64-CPU >> > system comes to about 500MB. This is beyond excessive for preallocation >> > on the systems I am familiar with. >> > >> > And the flooding case is where you most want the reclamation to be >> > efficient, and thus where you want the pages. As we now established that taking zone lock is impossible at all independent of raw/non-raw ordering and independent of RT/PREEMPT configs, can we just take a step back and look at the problem from scratch again? As a matter of fact I assume^Wdeclare that removing struct rcu_head which provides a fallback is not an option at all. I know that you want to, but it wont work ever. Dream on, but as we agreed on recently there is this thing called reality which ruins everything. For normal operations a couple of pages which can be preallocated are enough. What you are concerned of is the case where you run out of pointer storage space. There are two reasons why that can happen: 1) RCU call flooding 2) RCU not being able to run and mop up the backlog #1 is observable by looking at the remaining storage space and the RCU call frequency #2 is uninteresting because it's caused by RCU being stalled / delayed e.g. by a runaway of some sorts or a plain RCU usage bug. Allocating more memory in that case does not solve or improve anything. So the interesting case is #1. Which means we need to look at the potential sources of the flooding: 1) User space via syscalls, e.g. open/close 2) Kernel thread 3) Softirq 4) Device interrupt 5) System interrupts, deep atomic context, NMI ... #1 trivial fix is to force switching to an high prio thread or a soft interrupt which does the allocation #2 Similar to #1 unless that thread loops with interrupts, softirqs or preemption disabled. If that's the case then running out of RCU storage space is the least of your worries. #3 Similar to #2. The obvious candidates (e.g. NET) for monopolizing a CPU have loop limits in place already. If there is a bug which fails to care about the limit, why would RCU care and allocate more memory? #4 Similar to #3. If the interrupt handler loops forever or if the interrupt is a runaway which prevents task/softirq processing then RCU free performance is the least of your worries. #5 Clearly a bug and making RCU accomodate for that is beyond silly. So if call_rcu() detects that the remaining storage space for pointers goes below the critical point or if it observes high frequency calls then it simply should force a soft interrupt which does the allocation. Allocating from softirq context obviously without holding the raw lock which is used inside call_rcu() is safe on all configurations. If call_rcu() is forced to use the fallback for a few calls until this happens then that's not the end of the world. It is not going to be a problem ever for the most obvious issue #1, user space madness, because that case cannot delay the softirq processing unless there is a kernel bug which makes again RCU free performance irrelevant. So this will cure the problem for the most interesting case #1 and handle all sane variants of the other possible flooding sources. The other insane reasons do not justify any attempt to increase RCU performance at all. Watching the remaining storage space is good enough IMO. It clearly covers #1 and for all others the occasional fallback wont hurt. If it really matters for any case > #1 then doing a frequency based prediction is a straight forward optimization. As usual I might be missing something, but as this is the second day with reasonable temperatures here that would be caused by my ignorance and not be excusable by brain usage outside of specified temperature range. Thanks, tglx