On Sun, Dec 10, 2017 at 03:53:21PM +0200, Julian Anastasov wrote: > > Hello, > > On Sun, 10 Dec 2017, Inju Song wrote: > > > Implements the Google's Maglev hashing algorithm as a IPVS scheduler. > > Basically it provides consistent hashing but offers some special > > features about disruption and load balancing. > > > > Seel also: [3.4 Consistent Hasing] > > in https://static.googleusercontent.com/media/research.google.com/ko//pubs/archive/44824.pdf > > > > V2: Fix some methods and add features based on Julian's advice. > > It is good to see incremental changes but at the final > step it is better to see single patch with a well-formed subject, > with new version, eg: > > [PATCHv2 nf-next] ipvs: Add Maglev consistent hashing > yes. I will make next maglev patch as single but maybe there will be some patchsets for v2 If it needed(ie. add some config to Kconfig). > where the used tree is net-next, nf-next or ipvs-next. > > Few notes: > > - .upd_dest: changing weight to 0 should be ignored > but if we want to allow changing the weight from/to non-0 values > then we should save in array the latest non-0 weights for the > current dests. As it is difficult to track the dest deletions > may be we can add last_weight in struct ip_vs_dest. Then MH > will work with dest->last_weight. > I think that It needs to allow changing the weight from/to non-0 values so I will add to allow only if is non-0 values. > - it is good to have configurable table size, just like > IP_VS_SH_TAB_BITS in Kconfig. You can even use 5-6 prime numbers > like 65537 and below, if needed. Large value is not so fatal, > table is updated only when servers are added (at start), so > if one uses little differences in weights and small number of > servers the small prime number can reduce the cache usage. > As we know, the table should be able to map all possible dests > based on relative weights. To give few examples: > > - for 2 servers with same weight 5 we need 2 table entries, > one per server > > - for 2 servers with weight 5 and 10 we need 3 entries, > 1 for first server and 2 for the second server > > - But in practice we can run hundreds of servers with > relative weight difference 1-64 if number of CPUs are > considered, for example. > > - To avoid waste of memory due to kmalloc's orders we can use the > previous prime number, 65521, others can be 32749, 16381, 8191, > 4093, 2039, 1021, some no so useful values due to page size: > > - 509 (2KB with 4-byte ptr, 4KB for 8-byte ptr) > - 251 (1KB/2KB of 4KB page) > > With such primes the table should be a separate allocation. > I agress this. To avoid waste of memory and provide various prime number, it is saved with static array like below. // available prime number static int PRIMES[] = { 1021, 2039, 4093, 8191, 16381, 32749, 65521, 131071}; I want to proivde larger space then 65521, so add 131071 to last index of primes. If large value is not so fatal I think it is better in case of disruption of consistent hashing. I think It is helpful for users who have high performance machine and want more minimal disruption when destinations are add/deleted. > - when table has constant size, it can be part of the state, > as result, it will be allocated only once by ip_vs_mh_init_svc > where any errors are handled properly, any reallocations in > ip_vs_mh_reassign are unwanted because currently the errors are > ignored by the IPVS core (add_dest/upd_dest/del_dest), so > we should keep the s->permutation[] allocated. But as it is too > large, see below how to avoid s->permutation[]... > > - no need to use "inline" for ip_vs_mh_permutate and ip_vs_mh_populate, > they are used only on configuration time, better the compiler to decide > Yes. I will remove it from ip_vs_mh_permutate and ip_vs_mh_populate. > - another possible issue is that users can use crazy values for > weight, for example, 64 and 127. In such case even the gcd() > function can not help enough for something that would be 1:2 > relation. But we can warn users to use as small as possible > values for weights. If needed, we can rshift the weights, so > that only the highest 5-6 bits of the largest weight are considered. > I.e. if we find the largest weight is 2000 we see that 2000 > occupies 11 bits, so we decide to use the highest 6 bits and > rshift with 5 (11-6) all weights. All resulting weights will > fit in 6 bits. > I agree that huge weight values are considerd and rshift is good idea I think. To handle it, both rshift and last_weight are considerd. I wrote my opinion below. > - ip_vs_mh_populate allocates IP_VS_MH_TAB_SIZE ints for next[] but > later next[i] can go up to num_dests which can be above the > IP_VS_MH_TAB_SIZE. We should use IP_VS_MH_TAB_SIZE as M and > Kconfig should show the above prime numbers as options. Or > we can configure IP_VS_MH_TAB_BITS as 8..16 and to use it as > index to get the actual prime value from array: > > static int primes[] = { 251, ... 65521 }; > I will add IP_VS_MH_BITS and help message in Kconfig to able to configure it like below. config IP_VS_MH_TAB_INDEX range 8..16 default 12 // and some helpful messages. and #define IP_VS_MH_TAB_INDEX CONFIG_IP_VS_MH_TAB_INDEX - 8 #define IP_VS_MH_TAB_SIZE PRIMES[IP_VS_MH_TAB_INDEX] in header of ip_vs_mh.c > - we can avoid large allocations for s->permutation(N*M), > instead we can use: > > struct ip_vs_mh_dest_setup { > int offset; /* Starting offset */ > int skip; /* skip */ > int turns; /* weight / gcd() */ > int perm; /* next_offset */ > }; > > struct ip_vs_mh_dest_setup *dests; /* 1..num_dests */ > > dests should be allocated (-ENOMEM) and freed on > every ip_vs_mh_reassign. > > Initially, we should calculate .offset and .skip for all > dests. Then we should set dests[i].perm to dests[i].offset. > When adding weight to the picture, we can calculate greatest > common divisor with the gcd() function and then to use it for > calculating 'turns' per dest. ip_vs_wrr_gcd_weight() is a good > example how to get gcd() among all dests. > > When populating table: > > The operation 'c = permutation[i][next[i]];' becomes > > c = dests[i].perm; > > The operation 'next[i] = next[i] + 1;' becomes: > > /* Add skip, mod M */ > dests[i].perm += dests[i].skip; > if (dests[i].perm >= M) > dests[i].perm -= M; > > This should be cheaper than '% 65537'. And it is correct > because: > 1. offset is 0 .. M-1 > 2. skip is 1 .. M-1 > This is very good idea I think. It shoud be used so I will change ip_vs_mh_permutate and ip_vs_mh_populate to avoid allocating s->permutation. And re-configuring weight of dests maybe have some flow in ip_vs_init_svc and ip_vs_mh_ dest_changed like below. 1. check max weight, decide value of bits range for rshift and set last weigit to dests->last_weight with rshift. - last_weight should be set only when it is changed. - If weight is zero, skip to set last_weight 2. get gcd with all dests - maybe gcd() will use dest->last_weight value not dest->weight. 3. set offset, skip and turns in ip_vs_mh_permutate dests[i].tunrs = dest->last_weight / gcd; 4. use relative weight in ip_vs_mh_populate if (++dt_count >= dests[i].turns) { p = p -> next; dt_count = 0; i++; } and I think that the configuring weight flow need to be protected by spin_lock. > - Lets do not put limit for number of dests. We can add > only such check: > > if (svc->num_dest > IP_VS_MH_LOOKUP_SIZE) > return -EINVAL; > yes. I will add such check. > - IP_VS_MH_LOOKUP_SIZE should be replaced with IP_VS_MH_TAB_SIZE > It is will be replaced with IP_VS_MH_TAB_SIZE. > Note that I only check the implementation for problems, > I don't run any tests for the algorithm. I hope you have some > test program that pushes bunch of random IP[:PORT] values via the > algorithm and the result is correct distribution based on the relative > weights. > I alredy have some a kind of test program with destination as container and always run it when I add some features and make patches. so I will attach some result about correct distribtion at next patch if needed. > > Inju Song (3): > > netfilter: ipvs: handle memory allocation fail in mh scheduler. > > netfilter: ipvs: release refcnt and improve ip_vs_mh_reassign() > > netfilter: ipvs: change method to fill lookup table based on weight. > > > > net/netfilter/ipvs/ip_vs_mh.c | 147 +++++++++++++++++++++++++++++------------- > > 1 file changed, 101 insertions(+), 46 deletions(-) > > Regards > > -- > Julian Anastasov <ja@xxxxxx> Regards -- Inju Song NAVER Corporation -- To unsubscribe from this list: send the line "unsubscribe lvs-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html