During moving from 3.18 stable kernel to 4.9 on the switches, rebasing local -specific patches and stuff, it was found that BGP benchmarks for full view have started to hit the soft lockup detector by holding rtnl mutex for a couple of seconds on routes changes. I've found that the hard-coded MAX_WORK doesn't limit the amount of pending rebalancing work anymore. So that any route adding/removal may cause massive rebalancing of the trie. And making the hit even worse, tnodes are freed by RCU with a call to synchronise_rcu() every 512Kb. That is way too small and even on 2-cores switches is painfully noticable. To address those problems, I've introduced sysctl knob to limit the amount of rebalancing work (by default unlimited as is de facto now). I've moved synchronise_rcu() into a new shrinker to actually release memory in OOM.. I believe non-visible to userspace shrinker is better than a new sysctl knob for the limit (or any hard-coded value). Though, not sure how sane the result is. So, I send it as RFC, having qualms that it's not ready for inclusion as-is yet. I've looked further into the origin of problems here and was thinking if it make sense to do the following (rough plan): 1. Introduce a new flag RTNL_FLAG_DUMP_UNLOCKED to dump fib trie without rtnl lock. I'm a bit out of context, so probably I miss some obvious reasons why lock needs to be held at this point. 2. Add a new fib_lock mutex for updating a trie. I'm not really sure that we can always release rtnl for updates, so probably there should be a strict locking order: rtnl_lock (when needed) => fib_lock. 3. Correct current documentation, that mentions fib_lock as rwsem. 4. ?? I did some experiments on the plan above, but decided to send this RFC to get opinions of people who understand more. Maybe, my plan is nonsense and it's not worth invest amount of time it requires. I've also looked into changes between v3.18...v4.9, and found the following patches set: https://www.spinics.net/lists/netdev/msg309586.html While it has impressive results on lookups, it seems to be the reason of the regression on the big scale: one of the patches has 5 times penalty to remove a route on a big scale, another adds 10 times penalty by calling synchronise_rcu() more frequently (I've marked them by Fixes tag in the patches). [I was very lavish on Cc list, please ping me in private if you don't want to be copied on the next version] Cc: Alexander Duyck <alexander.h.duyck@xxxxxxxxxxxxxxx> Cc: Alexey Kuznetsov <kuznet@xxxxxxxxxxxxx> Cc: David Ahern <dsahern@xxxxxxxxx> Cc: "David S. Miller" <davem@xxxxxxxxxxxxx> Cc: Eric Dumazet <edumazet@xxxxxxxxxx> Cc: Hideaki YOSHIFUJI <yoshfuji@xxxxxxxxxxxxxx> Cc: Ido Schimmel <idosch@xxxxxxxxxxxx> Cc: netdev@xxxxxxxxxxxxxxx Thanks, Dmitry Dmitry Safonov (4): net/ipv4/fib: Remove run-time check in tnode_alloc() net/fib: Provide fib_balance_budget sysctl net/fib: Check budget before should_{inflate,halve}() net/ipv4/fib: Don't synchronise_rcu() every 512Kb Documentation/networking/ip-sysctl.txt | 6 ++ include/net/ip.h | 1 + net/ipv4/fib_trie.c | 121 +++++++++++++++---------- net/ipv4/sysctl_net_ipv4.c | 7 ++ 4 files changed, 87 insertions(+), 48 deletions(-) -- 2.21.0