Hello, On Mon, 21 May 2018, Michal Koutný wrote: > IPVS includes protection against filling the ip_vs_conn_tab by dropping 1/32 of > feasible entries every second. The template entries (for persistent services) > are never directly deleted by this mechanism but when a picked TCP connection > entry is being dropped (1), the respective template entry is dropped too > (realized by expiring 60 seconds after the connection entry being dropped). We try to drop the template in ip_vs_random_dropentry() but I guess kernel/time/timer.c:enqueue_timer() puts both timers in reverse order for expiration by using hlist_add_head(). > There is another mechanism that removes connection entries when they > time out (2), in this case the associated template entry is not deleted. > Under SYN flood template entries would accumulate (due to their entry > longer timeout). There is also ip_vs_todrop() called in tcp_conn_schedule(). It just drops specific part from the SYNs on memory pressure. > The accumulation takes place also with drop_entry being enabled. Roughly > 15% ((31/32)^60) of SYN_RECV connections survive the dropping mechanism > (1) and are removed by the timeout mechanism (2)(defaults to 60 seconds > for SYN_RECV), thus template entries would still accumulate. > > The patch ensures that when a connection entry times out, we also remove the > template entry from the table. To prevent breaking persistent services (since > the connection may time out in already established state) we add a new entry > flag to protect templates what spawned at least one established TCP connection. > > Cc: Michal Kubeček <mkubecek@xxxxxxxx> > Signed-off-by: Michal Koutný <mkoutny@xxxxxxxx> > --- > include/uapi/linux/ip_vs.h | 33 +++++++++++++++++---------------- > net/netfilter/ipvs/ip_vs_conn.c | 10 +++++++++- > net/netfilter/ipvs/ip_vs_core.c | 15 ++++++++++++++- > net/netfilter/ipvs/ip_vs_proto_tcp.c | 6 ++++++ > 4 files changed, 46 insertions(+), 18 deletions(-) > > diff --git a/include/uapi/linux/ip_vs.h b/include/uapi/linux/ip_vs.h > index 1c916b2f89dc..ef3bbc001fcd 100644 > --- a/include/uapi/linux/ip_vs.h > +++ b/include/uapi/linux/ip_vs.h > @@ -79,22 +79,23 @@ > * IPVS Connection Flags > * Only flags 0..15 are sent to backup server > */ > -#define IP_VS_CONN_F_FWD_MASK 0x0007 /* mask for the fwd methods */ > -#define IP_VS_CONN_F_MASQ 0x0000 /* masquerading/NAT */ > -#define IP_VS_CONN_F_LOCALNODE 0x0001 /* local node */ > -#define IP_VS_CONN_F_TUNNEL 0x0002 /* tunneling */ > -#define IP_VS_CONN_F_DROUTE 0x0003 /* direct routing */ > -#define IP_VS_CONN_F_BYPASS 0x0004 /* cache bypass */ > -#define IP_VS_CONN_F_SYNC 0x0020 /* entry created by sync */ > -#define IP_VS_CONN_F_HASHED 0x0040 /* hashed entry */ > -#define IP_VS_CONN_F_NOOUTPUT 0x0080 /* no output packets */ > -#define IP_VS_CONN_F_INACTIVE 0x0100 /* not established */ > -#define IP_VS_CONN_F_OUT_SEQ 0x0200 /* must do output seq adjust */ > -#define IP_VS_CONN_F_IN_SEQ 0x0400 /* must do input seq adjust */ > -#define IP_VS_CONN_F_SEQ_MASK 0x0600 /* in/out sequence mask */ > -#define IP_VS_CONN_F_NO_CPORT 0x0800 /* no client port set yet */ > -#define IP_VS_CONN_F_TEMPLATE 0x1000 /* template, not connection */ > -#define IP_VS_CONN_F_ONE_PACKET 0x2000 /* forward only one packet */ > +#define IP_VS_CONN_F_FWD_MASK 0x0007 /* mask for the fwd methods */ > +#define IP_VS_CONN_F_MASQ 0x0000 /* masquerading/NAT */ > +#define IP_VS_CONN_F_LOCALNODE 0x0001 /* local node */ > +#define IP_VS_CONN_F_TUNNEL 0x0002 /* tunneling */ > +#define IP_VS_CONN_F_DROUTE 0x0003 /* direct routing */ > +#define IP_VS_CONN_F_BYPASS 0x0004 /* cache bypass */ > +#define IP_VS_CONN_F_SYNC 0x0020 /* entry created by sync */ > +#define IP_VS_CONN_F_HASHED 0x0040 /* hashed entry */ > +#define IP_VS_CONN_F_NOOUTPUT 0x0080 /* no output packets */ > +#define IP_VS_CONN_F_INACTIVE 0x0100 /* not established */ > +#define IP_VS_CONN_F_OUT_SEQ 0x0200 /* must do output seq adjust */ > +#define IP_VS_CONN_F_IN_SEQ 0x0400 /* must do input seq adjust */ > +#define IP_VS_CONN_F_SEQ_MASK 0x0600 /* in/out sequence mask */ > +#define IP_VS_CONN_F_NO_CPORT 0x0800 /* no client port set yet */ > +#define IP_VS_CONN_F_TEMPLATE 0x1000 /* template, not connection */ > +#define IP_VS_CONN_F_ONE_PACKET 0x2000 /* forward only one packet */ > +#define IP_VS_CONN_F_TMPL_PERSISTED 0x4000 /* template, confirmed persistent */ > > /* Initial bits allowed in backup server */ > #define IP_VS_CONN_F_BACKUP_MASK (IP_VS_CONN_F_FWD_MASK | \ > diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c > index 370abbf6f421..6afc606a388c 100644 > --- a/net/netfilter/ipvs/ip_vs_conn.c > +++ b/net/netfilter/ipvs/ip_vs_conn.c > @@ -820,6 +820,7 @@ static void ip_vs_conn_rcu_free(struct rcu_head *head) > static void ip_vs_conn_expire(struct timer_list *t) > { > struct ip_vs_conn *cp = from_timer(cp, t, timer); > + struct ip_vs_conn *cp_c; > struct netns_ipvs *ipvs = cp->ipvs; > > /* > @@ -834,8 +835,15 @@ static void ip_vs_conn_expire(struct timer_list *t) > del_timer(&cp->timer); > > /* does anybody control me? */ > - if (cp->control) > + cp_c = cp->control; > + if (cp_c) { > ip_vs_control_del(cp); > + if (cp_c->flags & IP_VS_CONN_F_TEMPLATE && > + !(cp_c->flags & IP_VS_CONN_F_TMPL_PERSISTED)) { > + IP_VS_DBG(4, "del conn template\n"); > + ip_vs_conn_expire_now(cp_c); So, we have current conn expired after 60 seconds in IP_VS_TCP_S_SYN_RECV state and possibly other conns in same state that are not expired yet. Another option is just to use something like: if (cp_c) { ip_vs_control_del(cp); /* Restart cp_c timer only for last conn */ if (!atomic_read(&cp_c->n_control) && (cp_c->flags & IP_VS_CONN_F_TEMPLATE) && /* Some func to decide when to drop cp_c: * - it can be for SYN state * - it can be when cp was dropped on load */ cp->state == IP_VS_TCP_S_SYN_RECV) { IP_VS_DBG(4, "del conn template\n"); ip_vs_conn_expire_now(cp_c); } } It is not perfect, i.e. it does not know if there was some conn that was established in the past: - CONN1: SYN, SYN+ACK, ESTABLISH, FIN, FIN+ACK, expire - CONN2: expire in SYN state, drop tpl before persistent timeout But it should work in the general case. Anyways, give me some days to think more on this issue. Regards -- Julian Anastasov <ja@xxxxxx>