Hello, On Mon, 4 Dec 2023, Lev Pantiukhin wrote: > Maglev Hashing Stateless > ======================== > > Introduction > ------------ > > This patch to Linux kernel provides the following changes to IPVS: > > 1. Adds a new type (IP_VS_SVC_F_STATELESS) of scheduler that computes the > need for connection entry addition; I see the intention to avoid keeping connections. IPVS still creates connection struct for every packet for the IP_VS_CONN_F_ONE_PACKET mode but I'm not sure if this is faster than keeping conns in hash table. You probably have stats for this. > 2. Adds a new mhs (Maglev Hashing Stateless) scheduler based on the mh > scheduler that implements a new algorithm (more details below); > 3. Adds scheduling for ACK packets; > 4. Adds destination sorting (more details below). > > This approach shows a significant reduction in CPU usage, even in the > case of 10% of endpoints constantly flapping. It also makes the L4 It is crucial what strategy is used to deactivate dests. MH with setting weight to 0 should not change the lookup table. But add/remove always lead to problems. > balancer less vulnerable to DDoS activity. > > The Description of a New Algorithm > ---------------------------------- > > This patch provides a modified version of the Maglev consistent hashing > scheduling algorithm (scheduler mh). It simultaneously uses two hash > tables instead of one. One of them is for old destinations, and the other > (the candidate table) is for new ones. A hash key corresponds to two > destinations, and if both hash tables point to the same destination, then > the hash key is called stable; otherwise, it is called unstable. A new > connection entry is created only in the event of an unstable hash key; > otherwise, the packet goes through stateless processing. If the hash key > is unstable: > > * In the case of a SYN packet, it will pick up the destination from the > newer (candidate) hash table; > * In the case of an ACK packet, it will use the old hash table. > > Upon changing the set of destinations, mhs populates a new candidate hash > table and initializes a timer equal to the TCP session timeout. When the > timer expires, the candidate hash table value is merged into the old hash > table, and the corresponding hash key again becomes stable. If there are > changes in the destinations before the timer expires, mhs overwrites the > candidate hash table without the timer reset. If the set of destinations > is unchanged, the connection tracking table will be empty. > > IPVS stores destinations in an unordered way, so the same destination set > may generate different hash tables. To guarantee proper generation of the > Maglev hash table, the sorting of the destination list was added. This is > important in the case of destination flaps, which return the candidate > hash table to its original state. This patch implements sorting via > simple insertion with linear complexity. However, this complexity may be > simplified. > > Signed-off-by: Lev Pantiukhin <kndrvt@xxxxxxxxxxxxxx> > --- > include/net/ip_vs.h | 6 + > include/uapi/linux/ip_vs.h | 1 + > net/netfilter/ipvs/Kconfig | 9 + > net/netfilter/ipvs/Makefile | 1 + > net/netfilter/ipvs/ip_vs_core.c | 34 +- > net/netfilter/ipvs/ip_vs_ctl.c | 54 +- > net/netfilter/ipvs/ip_vs_mhs.c | 740 +++++++++++++++++++++++++++ > net/netfilter/ipvs/ip_vs_proto_tcp.c | 18 +- > 8 files changed, 851 insertions(+), 12 deletions(-) > create mode 100644 net/netfilter/ipvs/ip_vs_mhs.c > > diff --git a/include/uapi/linux/ip_vs.h b/include/uapi/linux/ip_vs.h > index 1ed234e7f251..cc205c1c796c 100644 > --- a/include/uapi/linux/ip_vs.h > +++ b/include/uapi/linux/ip_vs.h > @@ -24,6 +24,7 @@ > #define IP_VS_SVC_F_SCHED1 0x0008 /* scheduler flag 1 */ > #define IP_VS_SVC_F_SCHED2 0x0010 /* scheduler flag 2 */ > #define IP_VS_SVC_F_SCHED3 0x0020 /* scheduler flag 3 */ > +#define IP_VS_SVC_F_STATELESS 0x0040 /* stateless scheduling */ > > #define IP_VS_SVC_F_SCHED_SH_FALLBACK IP_VS_SVC_F_SCHED1 /* SH fallback */ > #define IP_VS_SVC_F_SCHED_SH_PORT IP_VS_SVC_F_SCHED2 /* SH use port */ > diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig > index 2a3017b9c001..886b75c48551 100644 > --- a/net/netfilter/ipvs/Kconfig > +++ b/net/netfilter/ipvs/Kconfig > @@ -246,6 +246,15 @@ config IP_VS_MH > If you want to compile it in kernel, say Y. To compile it as a > module, choose M here. If unsure, say N. > > +config IP_VS_MHS > + tristate "stateless maglev hashing scheduling" > + help > + The usual Maglev consistent hashing scheduling algorithm provides > + Google's Maglev hashing algorithm as an IPVS scheduler. > + This is a modified version of maglev consistent hashing scheduling algorithm. > + It simultaneously uses two hash tables instead of one. > + One of them is for old destinations, and the other is for new ones. Looks like MHS implicitly uses the CONFIG_IP_VS_MH_TAB_INDEX configuration. May be we should note it here. > + > config IP_VS_SED > tristate "shortest expected delay scheduling" > help > diff --git a/net/netfilter/ipvs/Makefile b/net/netfilter/ipvs/Makefile > index bb5d8125c82a..ffe9977397e0 100644 > --- a/net/netfilter/ipvs/Makefile > +++ b/net/netfilter/ipvs/Makefile > @@ -34,6 +34,7 @@ obj-$(CONFIG_IP_VS_LBLCR) += ip_vs_lblcr.o > obj-$(CONFIG_IP_VS_DH) += ip_vs_dh.o > obj-$(CONFIG_IP_VS_SH) += ip_vs_sh.o > obj-$(CONFIG_IP_VS_MH) += ip_vs_mh.o > +obj-$(CONFIG_IP_VS_MHS) += ip_vs_mhs.o > obj-$(CONFIG_IP_VS_SED) += ip_vs_sed.o > obj-$(CONFIG_IP_VS_NQ) += ip_vs_nq.o > obj-$(CONFIG_IP_VS_TWOS) += ip_vs_twos.o > diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c > index a2c16b501087..6aaf762c0a1d 100644 > --- a/net/netfilter/ipvs/ip_vs_core.c > +++ b/net/netfilter/ipvs/ip_vs_core.c > @@ -449,6 +449,7 @@ ip_vs_schedule(struct ip_vs_service *svc, struct sk_buff *skb, > __be16 _ports[2], *pptr, cport, vport; > const void *caddr, *vaddr; > unsigned int flags; > + bool need_state; > > *ignored = 1; > /* > @@ -525,7 +526,11 @@ ip_vs_schedule(struct ip_vs_service *svc, struct sk_buff *skb, > if (sched) { > /* read svc->sched_data after svc->scheduler */ > smp_rmb(); > - dest = sched->schedule(svc, skb, iph); > + /* we use distinct handler for stateless service */ > + if (svc->flags & IP_VS_SVC_F_STATELESS) Sometimes scheduler can be changed for svc, we should see if this should be per-scheduler flag somewhere in struct ip_vs_scheduler or simply to check for present schedule_sl. But probably in the end, it should go as a svc flag as you use it now. > + dest = sched->schedule_sl(svc, skb, iph, &need_state); > + else > + dest = sched->schedule(svc, skb, iph); > } else { > dest = NULL; > } > @@ -534,9 +539,11 @@ ip_vs_schedule(struct ip_vs_service *svc, struct sk_buff *skb, > return NULL; > } > > - flags = (svc->flags & IP_VS_SVC_F_ONEPACKET > - && iph->protocol == IPPROTO_UDP) ? > - IP_VS_CONN_F_ONE_PACKET : 0; > + /* We use IP_VS_SVC_F_ONEPACKET flag to create no state */ > + flags = ((svc->flags & IP_VS_SVC_F_ONEPACKET && > + iph->protocol == IPPROTO_UDP) || > + (svc->flags & IP_VS_SVC_F_STATELESS && !need_state)) > + ? IP_VS_CONN_F_ONE_PACKET : 0; > > /* > * Create a connection entry. > @@ -563,7 +570,10 @@ ip_vs_schedule(struct ip_vs_service *svc, struct sk_buff *skb, > IP_VS_DBG_ADDR(cp->daf, &cp->daddr), ntohs(cp->dport), > cp->flags, refcount_read(&cp->refcnt)); > > - ip_vs_conn_stats(cp, svc); > + if (!(svc->flags & IP_VS_SVC_F_STATELESS) || > + (svc->flags & IP_VS_SVC_F_STATELESS && need_state)) { > + ip_vs_conn_stats(cp, svc); So, here we do not know if it is a new connection... Then lets check IP_VS_HDR_NEW_CONN via new function ip_vs_iph_new_conn, we should create it like ip_vs_iph_inverse and ip_vs_iph_icmp. See below. > + } > return cp; > } > > @@ -1915,6 +1925,7 @@ ip_vs_in_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state > int ret, pkts; > struct sock *sk; > int af = state->pf; > + struct ip_vs_service *svc; > > /* Already marked as IPVS request or reply? */ > if (skb->ipvs_property) > @@ -1990,6 +2001,19 @@ ip_vs_in_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state > cp = INDIRECT_CALL_1(pp->conn_in_get, ip_vs_conn_in_get_proto, > ipvs, af, skb, &iph); > > + /* Don't use expired connection in stateless service case; > + * otherwise reuse can maintain the number connection entries > + */ > + if (cp && cp->dest) { > + svc = rcu_dereference(cp->dest->svc); > + > + if ((svc->flags & IP_VS_SVC_F_STATELESS) && > + !(timer_pending(&cp->timer) && time_after(cp->timer.expires, jiffies))) { > + __ip_vs_conn_put(cp); > + cp = NULL; > + } > + } Do we need special treatment here? Is it possible to see connections that do not expire? At least, it will advance its timer and it is impossible to see unexpired timer. > + > if (!iph.fragoffs && is_new_conn(skb, &iph) && cp) { > int conn_reuse_mode = sysctl_conn_reuse_mode(ipvs); > bool old_ct = false, resched = false; > diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c > index 143a341bbc0a..fda321edbd9c 100644 > --- a/net/netfilter/ipvs/ip_vs_ctl.c > +++ b/net/netfilter/ipvs/ip_vs_ctl.c > @@ -960,6 +960,43 @@ void ip_vs_stats_free(struct ip_vs_stats *stats) > } > } > > +static int __ip_vs_mh_compare_dests(struct list_head *a, struct list_head *b) > +{ > + struct ip_vs_dest *dest_a = list_entry(a, struct ip_vs_dest, n_list); > + struct ip_vs_dest *dest_b = list_entry(b, struct ip_vs_dest, n_list); > + unsigned int i = 0; > + __be32 diff; > + > + switch (dest_a->af) { > + case AF_INET: > + return (int)(dest_a->addr.ip - dest_b->addr.ip); > + > + case AF_INET6: > + for (; i < ARRAY_SIZE(dest_a->addr.ip6); i++) { > + diff = dest_a->addr.ip6[i] - dest_b->addr.ip6[i]; > + if (diff) > + return (int)diff; > + } > + } > + > + return 0; > +} > + > +static struct list_head * > +__ip_vs_find_insertion_place(struct list_head *new, struct list_head *head) > +{ > + struct list_head *p = head; > + int ret; > + > + while ((p = p->next) != head) { > + ret = __ip_vs_mh_compare_dests(new, p); > + if (ret < 0) > + break; > + } > + > + return p->prev; > +} > + > /* > * Update a destination in the given service > */ > @@ -1038,7 +1075,10 @@ __ip_vs_update_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest, > spin_unlock_bh(&dest->dst_lock); > > if (add) { > - list_add_rcu(&dest->n_list, &svc->destinations); > + /* sorting of dests list */ > + list_add_rcu(&dest->n_list, > + __ip_vs_find_insertion_place(&dest->n_list, > + &svc->destinations)); About the sorting of dests. There is no guarantee that sorting prevents hash mismatch on reconfiguration. In MH, ip_vs_mh_permutate() independently calculates primary offset for every dest (ds->perm) and later ip_vs_mh_populate() walks all dests in the order they are added (probably reverse order). Every dest gets chance to occupy primary slots in the table based on its weight. As the hash functions often result in collision, the next dests in the list has less chance to occupy their primary slots. So, the strategy of admin should be newly added dests to be considered last in the list. If list is sorted, this even complicates the addition of new servers because if they are inserted in the middle of the list they will disturb the hashing for the next dests in the list. In any case, the adding/deleting of dest is considered a disturbing operation for MH but MH allowed weight to be safely changed to 0 without reordering the lookup table, thanks to last_weight. In short, with sorting or no, it is enough to add the dests in the same order to duplicate the lookup table on reconfiguration. Sorting helps only if we add dests by hand in different order. Or may be I'm wrong? > svc->num_dests++; > sched = rcu_dereference_protected(svc->scheduler, 1); > if (sched && sched->add_dest) > @@ -1276,7 +1316,9 @@ static void __ip_vs_unlink_dest(struct ip_vs_service *svc, > struct ip_vs_dest *dest, > int svcupd) > { > - dest->flags &= ~IP_VS_DEST_F_AVAILABLE; > + /* dest must be available from trash for stateless service */ > + if (!(svc->flags & IP_VS_SVC_F_STATELESS)) > + dest->flags &= ~IP_VS_DEST_F_AVAILABLE; Not nice, see below > > /* > * Remove it from the d-linked destination list. > @@ -1440,6 +1482,10 @@ ip_vs_add_service(struct netns_ipvs *ipvs, struct ip_vs_service_user_kern *u, > svc->port = u->port; > svc->fwmark = u->fwmark; > svc->flags = u->flags & ~IP_VS_SVC_F_HASHED; > + if (!strcmp(u->sched_name, "mhs")) { > + svc->flags |= IP_VS_SVC_F_STATELESS; > + svc->flags &= ~IP_VS_SVC_F_PERSISTENT; > + } Should be part of ip_vs_mhs_init_svc, we can return -EINVAL there if IP_VS_SVC_F_PERSISTENT is set. Or to avoid stateless mode in such case with all consequences: if (!(svc->flags & IP_VS_SVC_F_PERSISTENT)) svc->flags |= IP_VS_SVC_F_STATELESS; Can we work in different mode if we can not set IP_VS_SVC_F_STATELESS due to some flags? But in any case ip_vs_mhs_done_svc() should clear IP_VS_SVC_F_STATELESS because ip_vs_edit_service() can be changing the scheduler. > svc->timeout = u->timeout * HZ; > svc->netmask = u->netmask; > svc->ipvs = ipvs; > @@ -1578,6 +1624,10 @@ ip_vs_edit_service(struct ip_vs_service *svc, struct ip_vs_service_user_kern *u) > * Set the flags and timeout value > */ > svc->flags = u->flags | IP_VS_SVC_F_HASHED; > + if (!strcmp(u->sched_name, "mhs")) { > + svc->flags |= IP_VS_SVC_F_STATELESS; > + svc->flags &= ~IP_VS_SVC_F_PERSISTENT; > + } Will be done in ip_vs_mhs_init_svc > svc->timeout = u->timeout * HZ; > svc->netmask = u->netmask; > > diff --git a/net/netfilter/ipvs/ip_vs_mhs.c b/net/netfilter/ipvs/ip_vs_mhs.c > new file mode 100644 > index 000000000000..ab19ac0f5b02 > --- /dev/null > +++ b/net/netfilter/ipvs/ip_vs_mhs.c > @@ -0,0 +1,740 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* IPVS: Stateless Maglev Hashing scheduling module > + * > + * Authors: Lev Pantiukhin <kndrvt@xxxxxxxxxxxxxx> > + * > + */ > + > +/* The mh algorithm is to assign a preference list of all the lookup > + * table positions to each destination and populate the table with > + * the most-preferred position of destinations. Then it is to select > + * destination with the hash key of source IP address through looking > + * up a the lookup table. > + * The mhs algorithm is modificated stateless version of mh algorithm. > + * It uses 2 look up tables and chooses one of 2 destinations. > + * > + * The mh algorithm is detailed in: > + * [3.4 Consistent Hasing] > +https://www.usenix.org/system/files/conference/nsdi16/nsdi16-paper-eisenbud.pdf > + * > + */ > + > +#define KMSG_COMPONENT "IPVS" > +#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt > + > +#include <linux/ip.h> > +#include <linux/slab.h> > +#include <linux/module.h> > +#include <linux/kernel.h> > +#include <linux/skbuff.h> > + > +#include <net/ip_vs.h> > + > +#include <linux/siphash.h> > +#include <linux/bitops.h> > +#include <linux/gcd.h> > + > +#include <linux/list_sort.h> > + > +#define IP_VS_SVC_F_SCHED_MH_FALLBACK IP_VS_SVC_F_SCHED1 /* MH fallback */ > +#define IP_VS_SVC_F_SCHED_MH_PORT IP_VS_SVC_F_SCHED2 /* MH use port */ > + > +struct ip_vs_mhs_lookup { > + struct ip_vs_dest __rcu *dest; /* real server (cache) */ > +}; > + > +struct ip_vs_mhs_dest_setup { > + unsigned int offset; /* starting offset */ > + unsigned int skip; /* skip */ > + unsigned int perm; /* next_offset */ > + int turns; /* weight / gcd() and rshift */ > +}; > + > +/* Available prime numbers for MH table */ > +static int primes[] = {251, 509, 1021, 2039, 4093, > + 8191, 16381, 32749, 65521, 131071}; > + > +/* For IPVS MH entry hash table */ > +#ifndef CONFIG_IP_VS_MH_TAB_INDEX > +#define CONFIG_IP_VS_MH_TAB_INDEX 12 > +#endif > +#define IP_VS_MH_TAB_BITS (CONFIG_IP_VS_MH_TAB_INDEX / 2) > +#define IP_VS_MH_TAB_INDEX (CONFIG_IP_VS_MH_TAB_INDEX - 8) > +#define IP_VS_MH_TAB_SIZE primes[IP_VS_MH_TAB_INDEX] > + > +struct ip_vs_mhs_state { > + struct rcu_head rcu_head; > + struct ip_vs_mhs_lookup *lookup; > + struct ip_vs_mhs_dest_setup *dest_setup; > + hsiphash_key_t hash1, hash2; > + int gcd; > + int rshift; > +}; > + > +struct ip_vs_mhs_two_states { > + struct ip_vs_mhs_state *first; > + struct ip_vs_mhs_state *second; > + ktime_t *timestamps; > + ktime_t unstable_timeout; > +}; > + > +struct ip_vs_mhs_two_dests { > + struct ip_vs_dest *dest; > + struct ip_vs_dest *new_dest; > + bool unstable; > +}; > + > +static inline bool > +ip_vs_mhs_is_new_conn(const struct sk_buff *skb, struct ip_vs_iphdr *iph) > +{ > + switch (iph->protocol) { > + case IPPROTO_TCP: { > + struct tcphdr _tcph, *th; > + > + th = skb_header_pointer(skb, iph->len, sizeof(_tcph), &_tcph); > + if (!th) > + return false; > + return th->syn; > + } > + default: > + return false; > + } > +} > + > +static inline void > +generate_hash_secret(hsiphash_key_t *hash1, hsiphash_key_t *hash2) > +{ > + hash1->key[0] = 2654435761UL; > + hash1->key[1] = 2654435761UL; > + > + hash2->key[0] = 2654446892UL; > + hash2->key[1] = 2654446892UL; > +} > + > +/* Returns hash value for IPVS MH entry */ > +static inline unsigned int > +ip_vs_mhs_hashkey(int af, const union nf_inet_addr *addr, __be16 port, > + hsiphash_key_t *key, unsigned int offset) > +{ > + unsigned int v; > + __be32 addr_fold = addr->ip; > + > +#ifdef CONFIG_IP_VS_IPV6 > + if (af == AF_INET6) > + addr_fold = addr->ip6[0] ^ addr->ip6[1] ^ > + addr->ip6[2] ^ addr->ip6[3]; > +#endif > + v = (offset + ntohs(port) + ntohl(addr_fold)); > + return hsiphash(&v, sizeof(v), key); > +} > + > +/* Reset all the hash buckets of the specified table. */ > +static void ip_vs_mhs_reset(struct ip_vs_mhs_state *s) > +{ > + int i; > + struct ip_vs_mhs_lookup *l; > + struct ip_vs_dest *dest; > + > + l = &s->lookup[0]; > + for (i = 0; i < IP_VS_MH_TAB_SIZE; i++) { > + dest = rcu_dereference_protected(l->dest, 1); > + if (dest) { > + ip_vs_dest_put(dest); > + RCU_INIT_POINTER(l->dest, NULL); > + } > + l++; > + } > +} > + > +/* Update timestamps with new lookup table */ > +static void > +ip_vs_mhs_update_timestamps(struct ip_vs_mhs_two_states *states) > +{ > + unsigned int offset = 0; > + > + while (offset < IP_VS_MH_TAB_SIZE) { > + if (states->first->lookup[offset].dest == > + states->second->lookup[offset].dest) { > + if (states->timestamps[offset]) { > + /* stabilization */ > + states->timestamps[offset] = (ktime_t)0; > + } > + } else { > + if (!states->timestamps[offset]) { > + /* destabilization */ > + states->timestamps[offset] = ktime_get(); > + } > + } > + ++offset; Can't we use jiffies? At least to call ktime_get() once? > + } > +} > + > +static int > +ip_vs_mhs_permutate(struct ip_vs_mhs_state *s, struct ip_vs_service *svc) > +{ > + struct list_head *p; > + struct ip_vs_mhs_dest_setup *ds; > + struct ip_vs_dest *dest; > + int lw; > + > + /* If gcd is smaller then 1, number of dests or > + * all weight of dests are zero. So, skip > + * permutation for the dests. > + */ > + if (s->gcd < 1) > + return 0; > + > + /* Set dest_setup for the dests permutation */ > + p = &svc->destinations; > + ds = &s->dest_setup[0]; > + while ((p = p->next) != &svc->destinations) { > + dest = list_entry(p, struct ip_vs_dest, n_list); > + > + ds->offset = ip_vs_mhs_hashkey(svc->af, &dest->addr, dest->port, > + &s->hash1, 0) % > + IP_VS_MH_TAB_SIZE; > + ds->skip = ip_vs_mhs_hashkey(svc->af, &dest->addr, dest->port, > + &s->hash2, 0) % > + (IP_VS_MH_TAB_SIZE - 1) + 1; > + ds->perm = ds->offset; > + > + lw = atomic_read(&dest->weight); > + ds->turns = ((lw / s->gcd) >> s->rshift) ?: (lw != 0); > + ds++; > + } > + return 0; > +} > + > +static int > +ip_vs_mhs_populate(struct ip_vs_mhs_state *s, struct ip_vs_service *svc) > +{ > + int n, c, dt_count; > + unsigned long *table; > + struct list_head *p; > + struct ip_vs_mhs_dest_setup *ds; > + struct ip_vs_dest *dest, *new_dest; > + > + /* If gcd is smaller then 1, number of dests or > + * all last_weight of dests are zero. So, skip > + * the population for the dests and reset lookup table. > + */ > + if (s->gcd < 1) { > + ip_vs_mhs_reset(s); > + return 0; > + } > + > + table = kcalloc(BITS_TO_LONGS(IP_VS_MH_TAB_SIZE), sizeof(unsigned long), > + GFP_KERNEL); MH uses bitmap_zalloc for this... > + if (!table) > + return -ENOMEM; > + > + p = &svc->destinations; > + n = 0; > + dt_count = 0; > + while (n < IP_VS_MH_TAB_SIZE) { > + if (p == &svc->destinations) > + p = p->next; > + > + ds = &s->dest_setup[0]; > + while (p != &svc->destinations) { > + /* Ignore added server with zero weight */ > + if (ds->turns < 1) { > + p = p->next; > + ds++; > + continue; > + } > + > + c = ds->perm; > + while (test_bit(c, table)) { > + /* Add skip, mod s->tab_size */ IP_VS_MH_TAB_SIZE, no s->tab_size > + ds->perm += ds->skip; > + if (ds->perm >= IP_VS_MH_TAB_SIZE) > + ds->perm -= IP_VS_MH_TAB_SIZE; > + c = ds->perm; > + } > + > + __set_bit(c, table); > + > + dest = rcu_dereference_protected(s->lookup[c].dest, 1); > + new_dest = list_entry(p, struct ip_vs_dest, n_list); > + if (dest != new_dest) { > + if (dest) > + ip_vs_dest_put(dest); > + ip_vs_dest_hold(new_dest); > + RCU_INIT_POINTER(s->lookup[c].dest, new_dest); > + } > + > + if (++n == IP_VS_MH_TAB_SIZE) > + goto out; > + > + if (++dt_count >= ds->turns) { > + dt_count = 0; > + p = p->next; > + ds++; > + } > + } > + } > + > +out: > + kfree(table); bitmap_free > + return 0; > +} > + > +/* Assign all the hash buckets of the specified table with the service. */ > +static int > +ip_vs_mhs_reassign(struct ip_vs_mhs_state *s, struct ip_vs_service *svc) > +{ > + int ret; > + > + if (svc->num_dests > IP_VS_MH_TAB_SIZE) > + return -EINVAL; > + > + if (svc->num_dests >= 1) { > + s->dest_setup = kcalloc(svc->num_dests, > + sizeof(struct ip_vs_mhs_dest_setup), > + GFP_KERNEL); > + if (!s->dest_setup) > + return -ENOMEM; > + } > + > + ip_vs_mhs_permutate(s, svc); > + > + ret = ip_vs_mhs_populate(s, svc); > + if (ret < 0) > + goto out; > + > + IP_VS_DBG_BUF(6, "MHS: %s(): reassign lookup table of %s:%u\n", > + __func__, > + IP_VS_DBG_ADDR(svc->af, &svc->addr), > + ntohs(svc->port)); > + > +out: > + if (svc->num_dests >= 1) { > + kfree(s->dest_setup); > + s->dest_setup = NULL; > + } > + return ret; > +} > + > +static int > +ip_vs_mhs_gcd_weight(struct ip_vs_service *svc) > +{ > + struct ip_vs_dest *dest; > + int weight; > + int g = 0; > + > + list_for_each_entry(dest, &svc->destinations, n_list) { > + weight = atomic_read(&dest->weight); > + if (weight > 0) { > + if (g > 0) > + g = gcd(weight, g); > + else > + g = weight; > + } > + } > + return g; > +} > + > +/* To avoid assigning huge weight for the MH table, > + * calculate shift value with gcd. > + */ > +static int > +ip_vs_mhs_shift_weight(struct ip_vs_service *svc, int gcd) > +{ > + struct ip_vs_dest *dest; > + int new_weight, weight = 0; > + int mw, shift; > + > + /* If gcd is smaller then 1, number of dests or > + * all weight of dests are zero. So, return > + * shift value as zero. > + */ > + if (gcd < 1) > + return 0; > + > + list_for_each_entry(dest, &svc->destinations, n_list) { > + new_weight = atomic_read(&dest->weight); > + if (new_weight > weight) > + weight = new_weight; > + } > + > + /* Because gcd is greater than zero, > + * the maximum weight and gcd are always greater than zero > + */ > + mw = weight / gcd; > + > + /* shift = occupied bits of weight/gcd - MH highest bits */ > + shift = fls(mw) - IP_VS_MH_TAB_BITS; > + return (shift >= 0) ? shift : 0; > +} > + > +static ktime_t > +ip_vs_mhs_get_unstable_timeout(struct ip_vs_service *svc) > +{ > + struct ip_vs_proto_data *pd; > + u64 tcp_to, tcp_fin_to; > + > + pd = ip_vs_proto_data_get(svc->ipvs, IPPROTO_TCP); > + tcp_to = pd->timeout_table[IP_VS_TCP_S_ESTABLISHED]; > + tcp_fin_to = pd->timeout_table[IP_VS_TCP_S_FIN_WAIT]; > + return ns_to_ktime(jiffies64_to_nsecs(max(tcp_to, tcp_fin_to))); > +} > + > +static void > +ip_vs_mhs_state_free(struct rcu_head *head) > +{ > + struct ip_vs_mhs_state *s; > + > + s = container_of(head, struct ip_vs_mhs_state, rcu_head); > + kfree(s->lookup); > + kfree(s); > +} > + > +static int > +ip_vs_mhs_init_svc(struct ip_vs_service *svc) > +{ > + struct ip_vs_mhs_state *s0, *s1; > + struct ip_vs_mhs_two_states *states; > + ktime_t *tss; > + int ret; Scheduler is assigned to virtual service in 2 cases: - common case: new service is created, no dests - rare case: scheduler is changed for existing service with present dests in svc->destinations See when ip_vs_bind_scheduler() is called So, when ip_vs_mhs_init_svc() is called, for the common case, we will build empty states->first table. As result, we will start initially with unstable period of 15 mins. But it is hard to tell when all initial dests are added if we want to avoid it. > + > + /* Allocate timestamps */ > + tss = kcalloc(IP_VS_MH_TAB_SIZE, sizeof(ktime_t), GFP_KERNEL); > + if (!tss) > + return -ENOMEM; > + > + /* Allocate the first MH table for this service */ > + s0 = kzalloc(sizeof(*s0), GFP_KERNEL); > + if (!s0) { > + kfree(tss); > + return -ENOMEM; > + } > + > + s0->lookup = kcalloc(IP_VS_MH_TAB_SIZE, sizeof(struct ip_vs_mhs_lookup), > + GFP_KERNEL); > + if (!s0->lookup) { > + kfree(tss); > + kfree(s0); > + return -ENOMEM; > + } > + > + generate_hash_secret(&s0->hash1, &s0->hash2); > + s0->gcd = ip_vs_mhs_gcd_weight(svc); > + s0->rshift = ip_vs_mhs_shift_weight(svc, s0->gcd); > + > + IP_VS_DBG(6, > + "MHS: %s(): The first lookup table (memory=%zdbytes) allocated\n", > + __func__, > + sizeof(struct ip_vs_mhs_lookup) * IP_VS_MH_TAB_SIZE); > + > + /* Assign the first lookup table with current dests */ > + ret = ip_vs_mhs_reassign(s0, svc); > + if (ret < 0) { > + kfree(tss); > + ip_vs_mhs_reset(s0); > + ip_vs_mhs_state_free(&s0->rcu_head); > + return ret; > + } > + > + /* Allocate the second MH table for this service */ > + s1 = kzalloc(sizeof(*s1), GFP_KERNEL); > + if (!s1) { > + kfree(tss); > + ip_vs_mhs_reset(s0); > + ip_vs_mhs_state_free(&s0->rcu_head); > + return -ENOMEM; > + } > + s1->lookup = kcalloc(IP_VS_MH_TAB_SIZE, sizeof(struct ip_vs_mhs_lookup), > + GFP_KERNEL); > + if (!s1->lookup) { > + kfree(tss); > + ip_vs_mhs_reset(s0); > + ip_vs_mhs_state_free(&s0->rcu_head); > + kfree(s1); > + return -ENOMEM; > + } > + > + s1->hash1 = s0->hash1; > + s1->hash2 = s0->hash2; > + s1->gcd = s0->gcd; > + s1->rshift = s0->rshift; > + > + IP_VS_DBG(6, > + "MHS: %s(): The second lookup table (memory=%zdbytes) allocated\n", > + __func__, > + sizeof(struct ip_vs_mhs_lookup) * IP_VS_MH_TAB_SIZE); > + > + /* Assign the second lookup table with current dests */ > + ret = ip_vs_mhs_reassign(s1, svc); > + if (ret < 0) { > + kfree(tss); > + ip_vs_mhs_reset(s0); > + ip_vs_mhs_state_free(&s0->rcu_head); > + ip_vs_mhs_reset(s1); > + ip_vs_mhs_state_free(&s1->rcu_head); Too much things to release, probably, a common release point will look less risky. > + return ret; > + } > + > + /* Allocate, initialize and attach states */ > + states = kcalloc(1, sizeof(struct ip_vs_mhs_two_states), GFP_KERNEL); > + if (!states) { > + kfree(tss); > + ip_vs_mhs_reset(s0); > + ip_vs_mhs_state_free(&s0->rcu_head); > + ip_vs_mhs_reset(s1); > + ip_vs_mhs_state_free(&s1->rcu_head); > + return -ENOMEM; > + } > + > + states->first = s0; > + states->second = s1; > + states->timestamps = tss; > + states->unstable_timeout = ip_vs_mhs_get_unstable_timeout(svc); > + svc->sched_data = states; > + return 0; > +} > + > +static void > +ip_vs_mhs_done_svc(struct ip_vs_service *svc) > +{ > + struct ip_vs_mhs_two_states *states = svc->sched_data; > + > + kfree(states->timestamps); Freeing in done_svc is not RCU safe. You can call ip_vs_mhs_reset but RCU callback should free 'states'. And we can not run many RCU callbacks in parallel because their execution order is not guaranteed. So, single call_rcu for states should be used where we should free the first/second states and also timestamps and finally 'states'. > + > + /* Got to clean up the first lookup entry here */ > + ip_vs_mhs_reset(states->first); > + > + call_rcu(&states->first->rcu_head, ip_vs_mhs_state_free); > + IP_VS_DBG(6, > + "MHS: The first MH lookup table (memory=%zdbytes) released\n", > + sizeof(struct ip_vs_mhs_lookup) * IP_VS_MH_TAB_SIZE); > + > + /* Got to clean up the second lookup entry here */ > + ip_vs_mhs_reset(states->second); > + > + call_rcu(&states->second->rcu_head, ip_vs_mhs_state_free); > + IP_VS_DBG(6, > + "MHS: The second MH lookup table (memory=%zdbytes) released\n", > + sizeof(struct ip_vs_mhs_lookup) * IP_VS_MH_TAB_SIZE); > + > + kfree(states); > +} > + > +static int > +ip_vs_mhs_dest_changed(struct ip_vs_service *svc, > + struct ip_vs_dest *dest) > +{ > + struct ip_vs_mhs_two_states *states = svc->sched_data; > + struct ip_vs_mhs_state *s1 = states->second; > + int ret; > + > + s1->gcd = ip_vs_mhs_gcd_weight(svc); > + s1->rshift = ip_vs_mhs_shift_weight(svc, s1->gcd); > + > + /* Assign the lookup table with the updated service */ > + ret = ip_vs_mhs_reassign(s1, svc); > + > + ip_vs_mhs_update_timestamps(states); > + states->unstable_timeout = ip_vs_mhs_get_unstable_timeout(svc); > + IP_VS_DBG(6, > + "MHS: %s: set unstable timeout: %llu", > + __func__, > + ktime_divns(states->unstable_timeout, > + NSEC_PER_SEC)); > + return ret; > +} > + > +/* Helper function to get port number */ > +static inline __be16 > +ip_vs_mhs_get_port(const struct sk_buff *skb, struct ip_vs_iphdr *iph) > +{ > + __be16 _ports[2], *ports; > + > + /* At this point we know that we have a valid packet of some kind. > + * Because ICMP packets are only guaranteed to have the first 8 > + * bytes, let's just grab the ports. Fortunately they're in the > + * same position for all three of the protocols we care about. > + */ > + switch (iph->protocol) { > + case IPPROTO_TCP: > + case IPPROTO_UDP: > + case IPPROTO_SCTP: > + ports = skb_header_pointer(skb, iph->len, sizeof(_ports), > + &_ports); > + if (unlikely(!ports)) > + return 0; > + > + if (likely(!ip_vs_iph_inverse(iph))) > + return ports[0]; > + else > + return ports[1]; > + default: > + return 0; > + } > +} > + > +/* Get ip_vs_dest associated with supplied parameters. */ > +static inline void > +ip_vs_mhs_get(struct ip_vs_service *svc, > + struct ip_vs_mhs_two_states *states, > + struct ip_vs_mhs_two_dests *dests, > + const union nf_inet_addr *addr, > + __be16 port) > +{ > + unsigned int hash; > + ktime_t timestamp; > + > + hash = ip_vs_mhs_hashkey(svc->af, addr, port, &states->first->hash1, > + 0) % IP_VS_MH_TAB_SIZE; > + dests->dest = rcu_dereference(states->first->lookup[hash].dest); > + dests->new_dest = rcu_dereference(states->second->lookup[hash].dest); > + timestamp = states->timestamps[hash]; > + > + /* only unstable hashes have non-zero value */ > + if (timestamp > 0) { > + /* unstable */ > + if (timestamp + states->unstable_timeout > ktime_get()) { > + /* timer didn't expire */ > + dests->unstable = true; > + return; > + } > + /* unstable -> stable */ > + if (dests->dest) > + ip_vs_dest_put(dests->dest); > + if (dests->new_dest) > + ip_vs_dest_hold(dests->new_dest); > + dests->dest = dests->new_dest; > + RCU_INIT_POINTER(states->first->lookup[hash].dest, > + dests->new_dest); > + states->timestamps[hash] = (ktime_t)0; These operations are not SMP safe, many readers may try to switch to stable state at the same time. May be some xchg operation for timestamps[] can help. But it also races with reconfiguration, i.e. ip_vs_mhs_update_timestamps(), ip_vs_mhs_populate(), etc. As it is a rare condition, spin_lock_bh(&state->lock) will help instead. You should revalidate states->timestamps[hash] under lock. > + } > + /* stable */ > + dests->unstable = false; > +} > + > +/* Stateless Maglev Hashing scheduling */ > +static struct ip_vs_dest * > +ip_vs_mhs_schedule(struct ip_vs_service *svc, > + const struct sk_buff *skb, > + struct ip_vs_iphdr *iph, > + bool *need_state) > +{ > + struct ip_vs_mhs_two_dests dests; > + struct ip_vs_dest *final_dest = NULL; > + struct ip_vs_mhs_two_states *states = svc->sched_data; > + __be16 port = 0; > + const union nf_inet_addr *hash_addr; > + > + *need_state = false; > + hash_addr = ip_vs_iph_inverse(iph) ? &iph->daddr : &iph->saddr; > + > + if (svc->flags & IP_VS_SVC_F_SCHED_MH_PORT) > + port = ip_vs_mhs_get_port(skb, iph); > + > + ip_vs_mhs_get(svc, states, &dests, hash_addr, port); > + IP_VS_DBG_BUF(6, > + "MHS: %s(): source IP address %s:%u --> server %s and %s\n", > + __func__, > + IP_VS_DBG_ADDR(svc->af, hash_addr), > + ntohs(port), > + dests.dest > + ? IP_VS_DBG_ADDR(dests.dest->af, &dests.dest->addr) > + : "NULL", > + dests.new_dest > + ? IP_VS_DBG_ADDR(dests.new_dest->af, > + &dests.new_dest->addr) > + : "NULL"); > + > + if (!dests.dest && !dests.new_dest) { > + /* Both dests is NULL */ > + return NULL; > + } > + > + if (!(dests.dest && dests.new_dest)) { > + /* dest is NULL or new_dest is NULL, > + * so we send all packets to singular available dest > + * and create state > + */ > + if (dests.new_dest) { > + /* dest is NULL */ > + final_dest = dests.new_dest; > + } else { > + /* new_dest is NULL */ > + final_dest = dests.dest; In two cases we return dests.dest without checking for IP_VS_DEST_F_AVAILABLE, even, you keep the flag set after dest is removed which is not nice. If we do not want to fallback, in this case we should return NULL, eg. for ACK. Any traffic should stop if !IP_VS_DEST_F_AVAILABLE and if weight=0 only established connections should work. As for IP_VS_DEST_F_OVERLOAD, if used, it should lead to allocating connection to fallback server, something not suitable for every scheduler. > + } > + *need_state = true; > + IP_VS_DBG(6, > + "MHS: %s(): One dest, need_state=%s\n", > + __func__, > + *need_state ? "true" : "false"); > + } else if (dests.unstable) { > + /* unstable */ > + if (iph->protocol == IPPROTO_TCP) { > + /* TCP */ > + *need_state = true; Looks like we can use iph.hdr_flags & IP_VS_HDR_NEW_CONN instead of ip_vs_mhs_is_new_conn. IP_VS_HDR_NEW_CONN can be set where we call is_new_conn in ip_vs_in_hook: if (!iph.fragoffs && is_new_conn(skb, &iph)) iph.hdr_flags |= IP_VS_HDR_NEW_CONN; if (iph.hdr_flags & IP_VS_HDR_NEW_CONN && cp) { > + if (ip_vs_mhs_is_new_conn(skb, iph)) { > + /* SYN packet */ > + final_dest = dests.new_dest; > + IP_VS_DBG(6, > + "MHS: %s(): Unstable, need_state=%s, SYN packet\n", > + __func__, > + *need_state ? "true" : "false"); > + } else { > + /* Not SYN packet */ > + final_dest = dests.dest; > + IP_VS_DBG(6, > + "MHS: %s(): Unstable, need_state=%s, not SYN packet\n", > + __func__, > + *need_state ? "true" : "false"); > + } > + } else if (iph->protocol == IPPROTO_UDP) { > + /* UDP */ > + final_dest = dests.new_dest; > + IP_VS_DBG(6, > + "MHS: %s(): Unstable, need_state=%s, UDP packet\n", > + __func__, > + *need_state ? "true" : "false"); > + } > + } else { > + /* stable */ > + final_dest = dests.dest; > + IP_VS_DBG(6, > + "MHS: %s(): Stable, need_state=%s\n", > + __func__, > + *need_state ? "true" : "false"); > + } > + return final_dest; > +} > + > +/* IPVS MHS Scheduler structure */ > +static struct ip_vs_scheduler ip_vs_mhs_scheduler = { > + .name = "mhs", > + .refcnt = ATOMIC_INIT(0), > + .module = THIS_MODULE, > + .n_list = LIST_HEAD_INIT(ip_vs_mhs_scheduler.n_list), > + .init_service = ip_vs_mhs_init_svc, > + .done_service = ip_vs_mhs_done_svc, > + .add_dest = ip_vs_mhs_dest_changed, > + .del_dest = ip_vs_mhs_dest_changed, > + .upd_dest = ip_vs_mhs_dest_changed, > + .schedule_sl = ip_vs_mhs_schedule, > +}; > + > +static int __init > +ip_vs_mhs_init(void) > +{ > + return register_ip_vs_scheduler(&ip_vs_mhs_scheduler); > +} > + > +static void __exit > +ip_vs_mhs_cleanup(void) > +{ > + unregister_ip_vs_scheduler(&ip_vs_mhs_scheduler); > + rcu_barrier(); > +} > + > +module_init(ip_vs_mhs_init); > +module_exit(ip_vs_mhs_cleanup); > +MODULE_DESCRIPTION("Stateless Maglev hashing ipvs scheduler"); > +MODULE_LICENSE("GPL"); > +MODULE_AUTHOR("Lev Pantiukhin <kndrvt@xxxxxxxxxxxxxx>"); > diff --git a/net/netfilter/ipvs/ip_vs_proto_tcp.c b/net/netfilter/ipvs/ip_vs_proto_tcp.c > index 7da51390cea6..31a8c1bfc863 100644 > --- a/net/netfilter/ipvs/ip_vs_proto_tcp.c > +++ b/net/netfilter/ipvs/ip_vs_proto_tcp.c > @@ -38,7 +38,7 @@ tcp_conn_schedule(struct netns_ipvs *ipvs, int af, struct sk_buff *skb, > struct ip_vs_iphdr *iph) > { > struct ip_vs_service *svc; > - struct tcphdr _tcph, *th; > + struct tcphdr _tcph, *th = NULL; > __be16 _ports[2], *ports = NULL; > > /* In the event of icmp, we're only guaranteed to have the first 8 > @@ -47,11 +47,8 @@ tcp_conn_schedule(struct netns_ipvs *ipvs, int af, struct sk_buff *skb, > */ > if (likely(!ip_vs_iph_icmp(iph))) { > th = skb_header_pointer(skb, iph->len, sizeof(_tcph), &_tcph); > - if (th) { > - if (th->rst || !(sysctl_sloppy_tcp(ipvs) || th->syn)) > - return 1; > + if (th) > ports = &th->source; > - } > } else { > ports = skb_header_pointer( > skb, iph->len, sizeof(_ports), &_ports); > @@ -74,6 +71,17 @@ tcp_conn_schedule(struct netns_ipvs *ipvs, int af, struct sk_buff *skb, > if (svc) { > int ignored; > > + if (th) { > + /* If sloppy_tcp or IP_VS_SVC_F_STATELESS is true, > + * all SYN packets are scheduled except packets > + * with set RST flag. > + */ > + if (!sysctl_sloppy_tcp(ipvs) && > + !(svc->flags & IP_VS_SVC_F_STATELESS) && > + (!th->syn || th->rst)) > + return 1; > + } Probably same can be done for sctp_conn_schedule() > + > if (ip_vs_todrop(ipvs)) { > /* > * It seems that we are very loaded. > -- > 2.17.1 Regards -- Julian Anastasov <ja@xxxxxx>