Re: [PATCH 0/3] netfilter: ipvs: Add Maglev consistent hashing scheduler

Inju Song <inju.song@xxxxxxxxxxxxx> · Mon, 18 Dec 2017 14:09:35 +0900



	Hello,
	Thanks for your comments. I will send patchsets for v2 of MH.

On Sun, Dec 17, 2017 at 01:47:55PM +0200, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Sun, 17 Dec 2017, Inju Song wrote:
> 
> > On Sun, Dec 10, 2017 at 03:53:21PM +0200, Julian Anastasov wrote:
> > > 
> > 	yes. I will make next maglev patch as single but maybe
> > there will be some patchsets for v2 If it needed(ie. add some
> > config to Kconfig).
> 
> 	Yes, Kconfig, Makefile should be in same patch.
> 
> > > - .upd_dest: changing weight to 0 should be ignored
> > > but if we want to allow changing the weight from/to non-0 values
> > > then we should save in array the latest non-0 weights for the
> > > current dests. As it is difficult to track the dest deletions
> > > may be we can add last_weight in struct ip_vs_dest. Then MH
> > > will work with dest->last_weight.
> > > 
> > 
> > 	I think that It needs to allow changing the weight from/to
> > non-0 values so I will add to allow only if is non-0 values.
> 
> 	Yep, then we should catch .upd_dest and to use last_weight.
> In fact, adding last_weight to IPVS core should be separate
> patch to apply before the MH patch.
> 

	Ok. So should I send patch about adding last_weight to IPVS core
before appling the MH patch? Then I will.

> > > - To avoid waste of memory due to kmalloc's orders we can use the
> > > previous prime number, 65521, others can be 32749, 16381, 8191,
> > > 4093, 2039, 1021, some no so useful values due to page size:
> > > 
> > > 	- 509 (2KB with 4-byte ptr, 4KB for 8-byte ptr)
> > > 	- 251 (1KB/2KB of 4KB page)
> > > 
> > > 	With such primes the table should be a separate allocation.
> > > 
> > 
> > 	I agress this. To avoid waste of memory and provide various
> > prime number, it is saved with static array like below.
> > 
> > // available prime number
> > static int PRIMES[] =
> 
> 	Do not use upper case for var names...
> 
> > { 1021, 2039, 4093, 8191, 16381, 32749, 65521, 131071};
> > 
> > 	I want to proivde larger space then 65521, so add 131071 to
> > last index of primes. If large value is not so fatal I think it is
> > better in case of disruption of consistent hashing. I think It is
> > helpful for users who have high performance machine and want more
> > minimal disruption when destinations are add/deleted.
> 
> 	OK, may be 131071 can help if using hundreds of servers...
> 
> > > - another possible issue is that users can use crazy values for
> > > weight, for example, 64 and 127. In such case even the gcd()
> > > function can not help enough for something that would be 1:2
> > > relation. But we can warn users to use as small as possible
> > > values for weights. If needed, we can rshift the weights, so
> > > that only the highest 5-6 bits of the largest weight are considered.
> > > I.e. if we find the largest weight is 2000 we see that 2000
> > > occupies 11 bits, so we decide to use the highest 6 bits and
> > > rshift with 5 (11-6) all weights. All resulting weights will
> > > fit in 6 bits.
> > > 
> > 
> > 	I agree that huge weight values are considerd and rshift is
> > good idea I think.
> > 	To handle it, both rshift and last_weight are considerd. I
> > wrote my opinion below.
> > 
> > > - ip_vs_mh_populate allocates IP_VS_MH_TAB_SIZE ints for next[] but
> > > later next[i] can go up to num_dests which can be above the
> > > IP_VS_MH_TAB_SIZE. We should use IP_VS_MH_TAB_SIZE as M and
> > > Kconfig should show the above prime numbers as options. Or
> > > we can configure IP_VS_MH_TAB_BITS as 8..16 and to use it as
> > > index to get the actual prime value from array:
> > > 
> > > static int primes[] = { 251, ... 65521 };
> > > 
> > 
> > 	I will add IP_VS_MH_BITS and help message in Kconfig to able to
> > configure it like below.
> > 
> > 	config IP_VS_MH_TAB_INDEX
> > 		range 8..16
> 
> 	OK. May be 8 .. 17 if we include 131071
> 
> > 		default 12
> > 		// and some helpful messages.
> > 
> > and
> > 
> > 	#define IP_VS_MH_TAB_INDEX	CONFIG_IP_VS_MH_TAB_INDEX - 8
> > 	#define IP_VS_MH_TAB_SIZE       PRIMES[IP_VS_MH_TAB_INDEX]
> > 
> > in header of ip_vs_mh.c
> 
> 	OK
> 
> > > - we can avoid large allocations for s->permutation(N*M),
> > > instead we can use:
> > > 
> > > struct ip_vs_mh_dest_setup {
> > > 	int		offset;		/* Starting offset */
> > > 	int		skip;		/* skip */
> > > 	int		turns;		/* weight / gcd() */
> > > 	int		perm;		/* next_offset */
> > > };
> > > 
> > > struct ip_vs_mh_dest_setup *dests;	/* 1..num_dests */
> > > 
> > > 	dests should be allocated (-ENOMEM) and freed on
> > > every ip_vs_mh_reassign.
> > > 
> > > 	Initially, we should calculate .offset and .skip for all
> > > dests. Then we should set dests[i].perm to dests[i].offset.
> > > When adding weight to the picture, we can calculate greatest
> > > common divisor with the gcd() function and then to use it for
> > > calculating 'turns' per dest. ip_vs_wrr_gcd_weight() is a good
> > > example how to get gcd() among all dests.
> > > 
> > > 	When populating table:
> > > 
> > > 	The operation 'c = permutation[i][next[i]];' becomes
> > > 
> > > 	c = dests[i].perm;
> > > 
> > > The operation 'next[i] = next[i] + 1;' becomes:
> > > 
> > > 	/* Add skip, mod M */
> > > 	dests[i].perm += dests[i].skip;
> > > 	if (dests[i].perm >= M)
> > > 		dests[i].perm -= M;
> > > 
> > > 	This should be cheaper than '% 65537'. And it is correct
> > > because:
> > > 	1. offset is 0 .. M-1
> > > 	2. skip is 1 .. M-1
> > > 
> > 
> > 	This is very good idea I think. It shoud be used so I will
> > change ip_vs_mh_permutate and ip_vs_mh_populate to avoid allocating
> > s->permutation.
> > 
> > 	And re-configuring weight of dests maybe have some flow in
> > ip_vs_init_svc and ip_vs_mh_ dest_changed like below.
> > 
> >     1. check max weight, decide value of bits range for rshift
> >        and set last weigit to dests->last_weight with rshift.
> >        - last_weight should be set only when it is changed.
> >        - If weight is zero, skip to set last_weight
> 
> 	Yes, but as I said last_weight change in ip_vs_ctl.c
> should be a separate patch, MH will just use last_weight for
> turns. If last_weight is 0 (server added with weight=0), we
> should ignore the server for MH table.
> 
> >     2. get gcd with all dests
> >        - maybe gcd() will use dest->last_weight value not dest->weight. 
> 
> 	Yep
> 
> >     3. set offset, skip and turns in ip_vs_mh_permutate
> > 
> >        dests[i].tunrs = dest->last_weight / gcd;
> 
> 	Yep. But we have to be careful, gcd is usually >= 1 but
> if all servers are with last_weight=0 then we should avoid
> divizion by zero and should populate the table with NULLs.
> ip_vs_mh_gcd_weight will give a clue if there is a server
> with last_weight>0, i.e. it should return 0 if all servers
> are with last_weight=0.
>

	Of course. These case should be handled. Thanks for your
advice.

> >     4. use relative weight in ip_vs_mh_populate
> > 
> >        if (++dt_count >= dests[i].turns) {
> >        		p = p -> next;
> >        		dt_count = 0;
> >        		i++;
> >        }
> 
> 	Yep
> 
> > 
> > 	and I think that the configuring weight flow need to be protected
> > by spin_lock.
> 
> 	locks are not needed, configuration is serialzed by
> using __ip_vs_mutex, the configuration methods run concurrently
> only with packet processing and we use RCU for the table.
> 

	Ok. then I will use RCU for the mh_setup table. 

> > > - Lets do not put limit for number of dests. We can add
> > > only such check:
> > > 
> > > 	if (svc->num_dest > IP_VS_MH_LOOKUP_SIZE)
> > > 		return -EINVAL;
> > > 
> > 
> > 	yes. I will add such check.
> > 
> > > - IP_VS_MH_LOOKUP_SIZE should be replaced with IP_VS_MH_TAB_SIZE
> > > 
> > 
> > 	It is will be replaced with IP_VS_MH_TAB_SIZE.
> > 
> > > 	Note that I only check the implementation for problems,
> > > I don't run any tests for the algorithm. I hope you have some
> > > test program that pushes bunch of random IP[:PORT] values via the
> > > algorithm and the result is correct distribution based on the relative
> > > weights.
> > > 
> > 
> > 	I alredy have some a kind of test program with destination as
> > container and always run it when I add some features and make patches.
> > so I will attach some result about correct distribtion at next patch
> > if needed.
> 
> 	As you wish, it is enough to know that the algorithm
> was tested to work as expected.

Regards

-- 
Inju Song
NAVER Corporation

--
To unsubscribe from this list: send the line "unsubscribe lvs-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html