Re: [PATCH 5/9] ipvs: use adaptive pause in master thread

Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> · Mon, 2 Apr 2012 13:11:45 +0200

Hi Simon et al.

On Wed, Mar 21, 2012 at 05:56:20PM +0900, Simon Horman wrote:
> From: Julian Anastasov <ja@xxxxxx>
> 
> 	High rate of sync messages in master can lead to
> overflowing the socket buffer and dropping the messages.
> Instead of using fixed pause try to adapt to the sync
> rate, so that we do not wakeup too often while at the same
> time staying below the socket buffer limit.

I see. You are avoiding the congestion in the socket queue by putting
the pressure in your sync_buff queue  (I don't find any limit in
the code for the amount of memory that it may consume btw).

Please, correct me if I'm wrong, but from this I can se you're
assuming that there's always room in the syn_buff queue to store
messages. This is valid if the high peak of traffic lasts for short
time. However, if it lasts long, the worker thread may not be able to
consume all the messages in time under high stress situation that
lasts long and the sync_buffer may keep growing more and more.

Moreover, the backup may receive sync messages talking about old
states that are not useful anymore to recover the load-balancing in
case of failover.

One more concern, please see below.

> Signed-off-by: Julian Anastasov <ja@xxxxxx>
> Tested-by: Aleksey Chudov <aleksey.chudov@xxxxxxxxx>
> Signed-off-by: Simon Horman <horms@xxxxxxxxxxxx>
> ---
>  net/netfilter/ipvs/ip_vs_sync.c |   60 +++++++++++++++++++++++++++++---------
>  1 files changed, 46 insertions(+), 14 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c
> index 0e36679..9201c43 100644
> --- a/net/netfilter/ipvs/ip_vs_sync.c
> +++ b/net/netfilter/ipvs/ip_vs_sync.c
> @@ -1392,18 +1392,22 @@ ip_vs_send_async(struct socket *sock, const char *buffer, const size_t length)
>  	return len;
>  }
>  
> -static void
> +static int
>  ip_vs_send_sync_msg(struct socket *sock, struct ip_vs_sync_mesg *msg)
>  {
>  	int msize;
> +	int ret;
>  
>  	msize = msg->size;
>  
>  	/* Put size in network byte order */
>  	msg->size = htons(msg->size);
>  
> -	if (ip_vs_send_async(sock, (char *)msg, msize) != msize)
> -		pr_err("ip_vs_send_async error\n");
> +	ret = ip_vs_send_async(sock, (char *)msg, msize);
> +	if (ret >= 0 || ret == -EAGAIN)
> +		return ret;
> +	pr_err("ip_vs_send_async error %d\n", ret);
> +	return 0;
>  }
>  
>  static int
> @@ -1428,33 +1432,61 @@ ip_vs_receive(struct socket *sock, char *buffer, const size_t buflen)
>  	return len;
>  }
>  
> +/* Get next buffer to send */
> +static inline struct ip_vs_sync_buff *
> +next_sync_buff(struct netns_ipvs *ipvs)
> +{
> +	struct ip_vs_sync_buff *sb;
> +
> +	sb = sb_dequeue(ipvs);
> +	if (sb)
> +		return sb;
> +	/* Do not delay entries in buffer for more than 2 seconds */
> +	return get_curr_sync_buff(ipvs, 2 * HZ);
> +}
>  
>  static int sync_thread_master(void *data)
>  {
>  	struct ip_vs_sync_thread_data *tinfo = data;
>  	struct netns_ipvs *ipvs = net_ipvs(tinfo->net);
> -	struct ip_vs_sync_buff *sb;
> +	struct ip_vs_sync_buff *sb = NULL;
> +	int pause = HZ / 10;
>  
>  	pr_info("sync thread started: state = MASTER, mcast_ifn = %s, "
>  		"syncid = %d\n",
>  		ipvs->master_mcast_ifn, ipvs->master_syncid);
>  
>  	while (!kthread_should_stop()) {
> -		while ((sb = sb_dequeue(ipvs))) {
> -			ip_vs_send_sync_msg(tinfo->sock, sb->mesg);
> -			ip_vs_sync_buff_release(sb);
> -		}
> +		int count = 0;
>  
> -		/* check if entries stay in ipvs->sync_buff for 2 seconds */
> -		sb = get_curr_sync_buff(ipvs, 2 * HZ);
> -		if (sb) {
> -			ip_vs_send_sync_msg(tinfo->sock, sb->mesg);
> -			ip_vs_sync_buff_release(sb);
> +		for (;;) {
> +			if (!sb) {
> +				sb = next_sync_buff(ipvs);
> +				if (!sb)
> +					break;
> +			}
> +			if (ip_vs_send_sync_msg(tinfo->sock, sb->mesg) >= 0 ||
> +			    pause == 1) {
> +				ip_vs_sync_buff_release(sb);
> +				sb = NULL;
> +				count++;
> +			} else {
> +				pause = max(1, (pause >> 1));
> +				break;
> +			}
>  		}
>  
> -		schedule_timeout_interruptible(HZ);
> +		/* Max packets to send at once */
> +		if (count > 200)
> +			pause = max(1, (pause - HZ / 20));
> +		else if (count < 20)
> +			pause = min(HZ / 4, (pause + HZ / 20));
> +		schedule_timeout_interruptible(sb ? 1 : pause);

This rate-limits the amount of times the worker is woken up.

Still, this seems to me like an ad-hoc congestion solution. There's no
justification why those numbers has been chosed, eg. why do we assume
that we're congested if we've reached 200 packets in one single loop?

To me, congestion control is a complicated thing (there is plenty of
literature for TCP avoid it). I'm not sure how many patches will
follow after this one to try to improve your congestion control
solution.

So, my question is, are you sure you want to enter this domain?

IMO, better to stick to some simple solution, ie. just drop messages
if the worker thread receives a high peak of work, than trying to
define some sort of rate-limit solution based on assumptions that are
not generic enough for every setup.
--
To unsubscribe from this list: send the line "unsubscribe lvs-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html