Re: [PATCH net-next v2] net: ipvs: randomize starting destination of RR/WRR scheduler

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



	Hello,

On Tue, 10 May 2022, menglong8.dong@xxxxxxxxx wrote:

> From: Menglong Dong <imagedong@xxxxxxxxxxx>
> 
> For now, the start of the RR/WRR scheduler is in order of added
> destinations, it will result in imbalance if the director is local
> to the clients and the number of connection is small.
> 
> For example, we have client1, client2, ..., client100 and real service
> service1, service2, ..., service10. All clients have local director with
> the same ipvs config, and each of them will create 2 long TCP connect to
> the virtual service. Therefore, all the clients will connect to service1
> and service2, leaving others free, which will make service1 and service2
> overloaded.

	More time I spend on this topic, I'm less
convinced that it is worth the effort. Randomness can come
from another place: client address/port. Schedulers
like MH and SH probably can help for such case.
RR is so simple scheduler that I doubt it is used in
practice. People prefer WLC, WRR and recently MH which
has many features:

- lockless
- supports server weights
- safer on dest adding/removal or weight changes
- fallback, optional
- suitable for sloppy TCP/SCTP mode to avoid the
SYNC mechanism in active-active setups

> Fix this by randomizing the starting destination when
> IP_VS_SVC_F_SCHED_RR_RANDOM/IP_VS_SVC_F_SCHED_WRR_RANDOM is set.
> 
> I start the randomizing from svc->destinations, as we choose the starting
> destination from all of the destinations, so it makes no different to
> start from svc->sched_data or svc->destinations. If we start from
> svc->sched_data, we have to avoid the 'head' node of the list being the
> next node of svc->sched_data, to make the choose totally random.

	Yes, first dest has two times more chance if we do
not account the head.

> Reviewed-by: Jiang Biao <benbjiang@xxxxxxxxxxx>
> Reviewed-by: Hao Peng <flyingpeng@xxxxxxxxxxx>
> Signed-off-by: Menglong Dong <imagedong@xxxxxxxxxxx>
> ---
> v2:
> - randomizing the starting of WRR scheduler too
> - Replace '|' with '&' in ip_vs_rr_random_start(Julian Anastasov)
> - Replace get_random_u32() with prandom_u32_max() (Julian Anastasov)
> ---
>  include/uapi/linux/ip_vs.h     |  3 +++
>  net/netfilter/ipvs/ip_vs_rr.c  | 25 ++++++++++++++++++++++++-
>  net/netfilter/ipvs/ip_vs_wrr.c | 20 ++++++++++++++++++++
>  3 files changed, 47 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/ip_vs.h b/include/uapi/linux/ip_vs.h
> index 4102ddcb4e14..9543906dae7d 100644
> --- a/include/uapi/linux/ip_vs.h
> +++ b/include/uapi/linux/ip_vs.h
> @@ -28,6 +28,9 @@
>  #define IP_VS_SVC_F_SCHED_SH_FALLBACK	IP_VS_SVC_F_SCHED1 /* SH fallback */
>  #define IP_VS_SVC_F_SCHED_SH_PORT	IP_VS_SVC_F_SCHED2 /* SH use port */
>  
> +#define IP_VS_SVC_F_SCHED_WRR_RANDOM	IP_VS_SVC_F_SCHED1 /* random start */
> +#define IP_VS_SVC_F_SCHED_RR_RANDOM	IP_VS_SVC_F_SCHED1 /* random start */
> +
>  /*
>   *      Destination Server Flags
>   */
> diff --git a/net/netfilter/ipvs/ip_vs_rr.c b/net/netfilter/ipvs/ip_vs_rr.c
> index 38495c6f6c7c..d53bfaf7aadf 100644
> --- a/net/netfilter/ipvs/ip_vs_rr.c
> +++ b/net/netfilter/ipvs/ip_vs_rr.c
> @@ -22,13 +22,36 @@
>  
>  #include <net/ip_vs.h>
>  
> +static void ip_vs_rr_random_start(struct ip_vs_service *svc)
> +{
> +	struct list_head *cur;
> +	u32 start;
> +
> +	if (!(svc->flags & IP_VS_SVC_F_SCHED_RR_RANDOM) ||
> +	    svc->num_dests <= 1)
> +		return;
> +
> +	start = prandom_u32_max(svc->num_dests);
> +	spin_lock_bh(&svc->sched_lock);
> +	cur = &svc->destinations;
> +	while (start--)
> +		cur = cur->next;
> +	svc->sched_data = cur;
> +	spin_unlock_bh(&svc->sched_lock);
> +}
>  
>  static int ip_vs_rr_init_svc(struct ip_vs_service *svc)
>  {
>  	svc->sched_data = &svc->destinations;
> +	ip_vs_rr_random_start(svc);
>  	return 0;
>  }
>  
> +static int ip_vs_rr_add_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest)
> +{
> +	ip_vs_rr_random_start(svc);
> +	return 0;
> +}
>  
>  static int ip_vs_rr_del_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest)
>  {
> @@ -104,7 +127,7 @@ static struct ip_vs_scheduler ip_vs_rr_scheduler = {
>  	.module =		THIS_MODULE,
>  	.n_list =		LIST_HEAD_INIT(ip_vs_rr_scheduler.n_list),
>  	.init_service =		ip_vs_rr_init_svc,
> -	.add_dest =		NULL,
> +	.add_dest =		ip_vs_rr_add_dest,
>  	.del_dest =		ip_vs_rr_del_dest,
>  	.schedule =		ip_vs_rr_schedule,
>  };
> diff --git a/net/netfilter/ipvs/ip_vs_wrr.c b/net/netfilter/ipvs/ip_vs_wrr.c
> index 1bc7a0789d85..ed6230976379 100644
> --- a/net/netfilter/ipvs/ip_vs_wrr.c
> +++ b/net/netfilter/ipvs/ip_vs_wrr.c
> @@ -102,6 +102,24 @@ static int ip_vs_wrr_max_weight(struct ip_vs_service *svc)
>  	return weight;
>  }
>  
> +static void ip_vs_wrr_random_start(struct ip_vs_service *svc)
> +{
> +	struct ip_vs_wrr_mark *mark = svc->sched_data;
> +	struct list_head *cur;
> +	u32 start;
> +
> +	if (!(svc->flags & IP_VS_SVC_F_SCHED_WRR_RANDOM) ||
> +	    svc->num_dests <= 1)
> +		return;
> +
> +	start = prandom_u32_max(svc->num_dests);
> +	spin_lock_bh(&svc->sched_lock);
> +	cur = &svc->destinations;
> +	while (start--)
> +		cur = cur->next;
> +	mark->cl = list_entry(cur, struct ip_vs_dest, n_list);

	The problem with WRR is that mark->cl and mark->cw
work together, we can not change just cl.

> +	spin_unlock_bh(&svc->sched_lock);
> +}
>  
>  static int ip_vs_wrr_init_svc(struct ip_vs_service *svc)
>  {
> @@ -119,6 +137,7 @@ static int ip_vs_wrr_init_svc(struct ip_vs_service *svc)
>  	mark->mw = ip_vs_wrr_max_weight(svc) - (mark->di - 1);
>  	mark->cw = mark->mw;
>  	svc->sched_data = mark;
> +	ip_vs_wrr_random_start(svc);
>  
>  	return 0;
>  }
> @@ -149,6 +168,7 @@ static int ip_vs_wrr_dest_changed(struct ip_vs_service *svc,
>  	else if (mark->di > 1)
>  		mark->cw = (mark->cw / mark->di) * mark->di + 1;
>  	spin_unlock_bh(&svc->sched_lock);
> +	ip_vs_wrr_random_start(svc);

	This will be called even on upd_dest (while processing
packets), so the region under lock should be short and cl/cw
state should not be damaged.

Regards

--
Julian Anastasov <ja@xxxxxx>




[Index of Archives]     [Linux Filesystem Devel]     [Linux NFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux SCSI]     [X.Org]

  Powered by Linux