Re: [PATCHv3 ipvs-next 2/3] netfilter: ipvs: Add Maglev consistent hashing scheduler

Julian Anastasov <ja@xxxxxx> · Sat, 24 Mar 2018 13:26:04 +0200 (EET)

	Hello,

	Only few notes below. If all goes well, v4 can be
the last version... :)

On Wed, 21 Mar 2018, Inju Song wrote:

> Implements the Google's Maglev hashing algorithm as a IPVS scheduler.
> 
> Basically it provides consistent hashing but offers some special
> features about disruption and load balancing.
> 
>   1) minimal disruption: when the set of destinations changes,
>      a connection will likely be sent to the same destination
>      as it was before.
> 
>   2) load balancing: each destination will receive an almost
>      equal number of connections.
> 
> Seel also for detail: [3.4 Consistent Hasing] in
> https://www.usenix.org/system/files/conference/nsdi16/nsdi16-paper-eisenbud.pdf

	Above commit message is properly indented but comments in
patches 1 and 3 should not be indented, there should not be
spaces at left.

> Signed-off-by: Inju Song <inju.song@xxxxxxxxxxxxx>
> ---
>  net/netfilter/ipvs/ip_vs_mh.c | 532 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 532 insertions(+)
>  create mode 100644 net/netfilter/ipvs/ip_vs_mh.c
> 
> diff --git a/net/netfilter/ipvs/ip_vs_mh.c b/net/netfilter/ipvs/ip_vs_mh.c
> new file mode 100644
> index 0000000..e531100
> --- /dev/null
> +++ b/net/netfilter/ipvs/ip_vs_mh.c
> @@ -0,0 +1,532 @@

> +		IP_VS_DBG_BUF(6,
> +			      "MH: selected unavailable server %s:%d (offset %d), reselecting",
> +			      IP_VS_DBG_ADDR(dest->af, &dest->addr),
> +			      ntohs(dest->port), roffset);

	As some vars are unsigned int, make sure all %d in
debug messages are changed to %u, for consistency. This
also includes all ports. I guess, these %d come from the
SH scheduler.

> +static void ip_vs_mh_state_free(struct rcu_head *head)
> +{
> +	struct ip_vs_mh_state *s;
> +
> +	s = container_of(head, struct ip_vs_mh_state, rcu_head);
> +	kfree(s->lookup);
> +	kfree(s);
> +}
> +
> +static int ip_vs_mh_init_svc(struct ip_vs_service *svc)
> +{
> +	struct ip_vs_mh_state *s;
> +
> +	/* allocate the MH table for this service */
> +	s = kzalloc(sizeof(*s), GFP_KERNEL);
> +	if (!s)
> +		return -ENOMEM;
> +
> +	s->lookup = kcalloc(IP_VS_MH_TAB_SIZE, sizeof(struct ip_vs_mh_lookup),
> +			    GFP_KERNEL);
> +	if (!s->lookup) {
> +		kfree(s);
> +		return -ENOMEM;
> +	}
> +
> +	generate_hash_secret(&s->hash1, &s->hash2);
> +	s->gcd = ip_vs_mh_gcd_weight(svc);
> +	s->rshift = ip_vs_mh_shift_weight(svc, s->gcd);
> +
> +	svc->sched_data = s;
> +	IP_VS_DBG(6,
> +		  "MH lookup table (memory=%zdbytes) allocated for current service\n",
> +		  sizeof(struct ip_vs_mh_lookup) * IP_VS_MH_TAB_SIZE);
> +
> +	/* permutate & populate with current dests */
> +	return ip_vs_mh_reassign(s, svc);

	Looks like we must not set svc->sched_data on error.
We can do it like this:

	int ret;

	...
	IP_VS_DBG()
	ret = ip_vs_mh_reassign(s, svc);
	if (ret < 0) {
		ip_vs_mh_reset(s); /* This is here for completeness */
		ip_vs_mh_state_free(&s->rcu_head);
		return ret;
	}
	/* No more failures, attach state */
	svc->sched_data = s;
	return 0;

	It should be ok because we use svc->sched_data only
from entry points, the scheduler does not use it internally.

> +}
> +
> +static void ip_vs_mh_done_svc(struct ip_vs_service *svc)
> +{
> +	struct ip_vs_mh_state *s = svc->sched_data;
> +
> +	/* got to clean up hash buckets here */
> +	ip_vs_mh_reset(s);
> +
> +	/* release the table itself */
> +	ip_vs_mh_state_free(&s->rcu_head);

	Above should be:

	call_rcu(&s->rcu_head, ip_vs_mh_state_free);

	As result, the RCU callback will be called after
RCU grace period giving chance to RCU readers to stop
using the svc->sched_data state.

> +	IP_VS_DBG(6, "MH lookup table (memory=%zdbytes) released\n",
> +		  sizeof(struct ip_vs_mh_lookup) * IP_VS_MH_TAB_SIZE);

Regards

--
Julian Anastasov <ja@xxxxxx>