On 4/1/19 7:09 PM, Alexander Duyck wrote: > On Tue, 2019-03-26 at 15:30 +0000, Dmitry Safonov wrote: >> Unfortunately, MAX_WORK at this moment is broken: during >> trie_rebalance() climbing upwards resize() is being called for each >> tnode. Which makes the limit useless the bigger trie gets: at each level >> there are 10 attempts to balance tnode and childs, resulting in >> O(10^n + 10^{n-1} + ... 10^{n-k}) complexity to balance trie, where k - is >> the level where we changed the trie originally (adding/removing >> alias/route). >> >> Which results in the following price of removing one route under big >> trie (similar for some other single routes that results in reallocating >> tnodes by hitting the threshold limits for halve/inflate resize): > > The info below doesn't really tell us the price. It might be useful if > you could just time the removal of that one route and provide that > information. Yes, well, that was an attempt to point that MAX_WORK in the current state doesn't limit much complexity to balance trie. So, most of the time (up to a couple of seconds) is spent on synchronise_rcu() (which was addressed by 4/4 and a patch from David Ahern in -next). But I thought fixing max_work is also worth attention, regardless the fix for synchronising rcu. And playing with the sysctl limit that is provided by the patch, I've found out that even with 4/4 applied one can tune this sysctl knob the way that results in dividing the work between consecutive route manipulations. So that on a 4-core switch the time spent to add a route in maximum was more than a second and adjusting work limit can bring it down to 300msec in max. Though, the total time spent to add the whole route table was the same. I will provide more details for v2. > >> Before: >> Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes. >> Main: >> Aver depth: 1.99 >> Max depth: 2 >> Leaves: 77825 >> Prefixes: 77825 >> Internal nodes: 77 >> 9: 1 10: 76 >> Pointers: 78336 >> Null ptrs: 435 >> Total size: 7912 kB >> Local: >> [omitted] >> >> After: >> Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes. >> Main: >> Aver depth: 3.00 >> Max depth: 3 >> Leaves: 77824 >> Prefixes: 77824 >> Internal nodes: 20491 >> 1: 2048 2: 18432 6: 1 11: 10 >> Pointers: 98368 >> Null ptrs: 54 >> Total size: 8865 kB >> Local: >> [omitted] >> > > Is this before/after the patch, or just before/after the removal of the > one route? I assume you are are just talking about the one route since > the number of prefixes has dropped by 1. Yes, just for removing one route. Should have mentioned that explicitly, sorry. > >> Provide a sysctl to control amount of pending balancing work. >> (by default unlimited as it was) >> >> Cc: linux-doc@xxxxxxxxxxxxxxx >> Cc: Jonathan Corbet <corbet@xxxxxxx> >> Fixes: ff181ed8768f ("fib_trie: Push assignment of child to parent down >> into inflate/halve") >> Signed-off-by: Dmitry Safonov <dima@xxxxxxxxxx> >> --- >> Documentation/networking/ip-sysctl.txt | 6 +++ >> include/net/ip.h | 1 + >> net/ipv4/fib_trie.c | 60 +++++++++++++++----------- >> net/ipv4/sysctl_net_ipv4.c | 7 +++ >> 4 files changed, 49 insertions(+), 25 deletions(-) >> >> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt >> index acdfb5d2bcaa..fb71dacff4dd 100644 >> --- a/Documentation/networking/ip-sysctl.txt >> +++ b/Documentation/networking/ip-sysctl.txt >> @@ -81,6 +81,12 @@ fib_multipath_hash_policy - INTEGER >> 0 - Layer 3 >> 1 - Layer 4 >> >> +fib_balance_budget - UNSIGNED INTEGER >> + Limits the number of resize attempts during balancing fib trie >> + on adding/removing new routes. >> + Possible values: >> + Default: UINT_MAX (0xFFFFFFFF) >> + > > So I am not really a fan of the use of UNSIGNED_INTEGER here. I would > much rather see this be a signed integer and the test be for the value > going negative instead of having to test in so many places if the value > is 0. Ok, sounds good to me - will change the type for v2. [..] >> @@ -874,22 +877,25 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn) >> break; >> } >> >> - max_work--; >> + (*budget)--; > > One thing we might explore here would be to look at pulling out the > check of budget at the start of the loop and instead combine the > decrement with the check for < 0 here so we have something like: > if (--*budget < 0) > break; > > That way what should happen is that we will achieve maximum depth in > any resize and make at least one pass optimizing from the bottom up > instead of the top down. Yeah, sounds reasonable - will do this way and retest. > >> + inflated = true; >> tn = get_child(tp, cindex); >> } >> >> /* update parent in case inflate failed */ >> tp = node_parent(tn); >> >> - /* Return if at least one inflate is run */ >> - if (max_work != MAX_WORK) >> + /* Return if at least one inflate is run: >> + * microoptimization to not recalculate thresholds >> + */ >> + if (inflated) >> return tp; > > An alternative to having to add another variable would be to look at > switching the loop to a combination of an if and a do/while like so: > if (should_inflate(tp, tn)) { > do { > ... > } while (should_inflate(tp, tn)); > > /* update parent in case inflate failed */ > return node_parent(tn); > } That's neat! Will drop the variable. > > >> /* Halve as long as the number of empty children in this >> * node is above threshold. >> */ >> - while (should_halve(tp, tn) && max_work) { >> - tp = halve(t, tn); >> + while (should_halve(tp, tn) && *budget) { >> + tp = halve(t, tn, budget); >> if (!tp) { >> #ifdef CONFIG_IP_FIB_TRIE_STATS >> this_cpu_inc(stats->resize_node_skipped); >> @@ -897,7 +903,7 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn) >> break; >> } >> >> - max_work--; >> + (*budget)--; > > Same here. It might make sense to pull the budget check out of the > start of the loop and move it here so that it becomes a depth first > optimization instead of being a shallow one. Makes sense, will rework. > >> tn = get_child(tp, cindex); >> } >> >> @@ -1005,8 +1011,10 @@ static struct fib_alias *fib_find_alias(struct hlist_head *fah, u8 slen, >> >> static void trie_rebalance(struct trie *t, struct key_vector *tn) >> { >> - while (!IS_TRIE(tn)) >> - tn = resize(t, tn); >> + unsigned int budget = fib_balance_budget; >> + >> + while (budget && !IS_TRIE(tn)) >> + tn = resize(t, tn, &budget); > > I would not have the budget check here. At a minimum we should be > replacing trie nodes with leaves in the case that we are down to one > leaf node in the trie. Yeah, I see. > >> } >> >> static int fib_insert_node(struct trie *t, struct key_vector *tp, >> @@ -1784,6 +1792,7 @@ struct fib_table *fib_trie_unmerge(struct fib_table *oldtb) >> void fib_table_flush_external(struct fib_table *tb) >> { >> struct trie *t = (struct trie *)tb->tb_data; >> + unsigned int budget = fib_balance_budget; >> struct key_vector *pn = t->kv; >> unsigned long cindex = 1; >> struct hlist_node *tmp; >> @@ -1806,7 +1815,7 @@ void fib_table_flush_external(struct fib_table *tb) >> update_suffix(pn); >> >> /* resize completed node */ >> - pn = resize(t, pn); >> + pn = resize(t, pn, &budget); >> cindex = get_index(pkey, pn); >> >> continue; > > So we would probably want a completely different budget here, or at > least we would want the budget to be multiplied by the leaves we are > removing. The problem is we aren't pulling single addresses, but are > literally splitting the tree into two separate tries. The same budget > isn't going to work for both cases. Which will multiply the work (and the time spent). Hmm, well, I guess we can multiply here and relax rtnl_mutex pressure if needed by introducing fib_lock. Does that sounds good? > >> @@ -1853,6 +1862,7 @@ void fib_table_flush_external(struct fib_table *tb) >> int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) >> { >> struct trie *t = (struct trie *)tb->tb_data; >> + unsigned int budget = fib_balance_budget; >> struct key_vector *pn = t->kv; >> unsigned long cindex = 1; >> struct hlist_node *tmp; >> @@ -1876,7 +1886,7 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) >> update_suffix(pn); >> >> /* resize completed node */ >> - pn = resize(t, pn); >> + pn = resize(t, pn, &budget); >> cindex = get_index(pkey, pn); >> >> continue; > > Similar issues here, though I don't know what the typical amount of > change is we expect to see from a fib_table_flush call versus something > like the fib_table_flush_external call. Yes. Well, I was also curious if it's possible to avoid resize() call under (flush_all == true)? So, that exiting a namespace wouldn't result in rebalancing a possibly huge routing trie. Thanks for your notes and review, Dmitry