Hi Joerg, I'm cc'ing netfilter-devel instead, that is used for development stuff. Joerg Marx wrote: > Because nobody commented this so far on the 'linux-netdev' list, now a > re-post on the 'netfilter' list with a kind request for confirmation > and/or discussion... > > ################################################################### > > > Hi, > > we encountered a weird problem of a 'stalled' connection when using > 'conntrack -F' on a box with heavy network load. 'conntrack -L' gave us > sometimes a [UNREPLIED] entry for the traffic in question, but no traffic > flow, no matching of packets in or out, only the timer went down from 600 > to 0 (it was ESP traffic - the default generic timeout = 600 seconds). > After the entry vanished by timeout (or doing a 'conntrack -F' once more), > all worked normal again. > > The reason we finally found, is a race window in 'nf_conntrack_confirm' > when calling '__nf_conntrack_confirm': > > In 'nf_conntrack_confirm' is checked (without holding a lock), if the entry > to be confirmed is possibly dying: !nf_ct_is_dying(ct). > If not, then __nf_conntrack_confirm will do some sanity checking, grab a > spin_lock_bh and insert the 'ct' into the lookup cache. > > Now consider the following scenario: > > 1. a connection has seen the first packet already -> state is UNREPLIED > 2. now the answer is to be sent, conntrack wants to confirm the connection > 3. the !nf_ct_is_dying(ct) check is passed, __nf_conntrack_confirm is just > started > 4. in a user context a 'conntrack -F' command is running right now e.g. on > another CPU > 5. this will flag all unconfirmed connections as 'dying' in > get_next_corpse(...), including the entry going to be confirmed! > 6. now the already 'dying' entry is included into the hash cache in > __nf_conntrack_confirm - BOOM! > > After this step the connection in question is dead, because no packets are > forwarded until the entry is purged from hash cache. This was a big blocker > for us, because each dead IPsec tunnel is a dead branch network for 10 > minutes... > > For every packet from now on 'nf_conntrack_find_get' will ignore the entry, > because it is dying and because __nf_conntrack_confirm finds the hash in > the cache already, it will NF_DROP the packet. > > The key for finding this was 'NF_CT_STAT_INC(net, insert_failed)' in > __nf_conntrack_confirm. > > The suggested solution is to check for '!nf_ct_is_dying(ct)' again, _after_ > the spin_lock_bh is grabbed in __nf_conntrack_confirm. So it is clear, that > no other softirq or user context can set that 'evil' dying flag ;-) > The return value in this case should be NF_ACCEPT, so we loose no packets > then, this is also important for us. > > > --- > net/netfilter/nf_conntrack_core.c | 5 +++++ > 1 files changed, 5 insertions(+), 0 deletions(-) > > diff --git a/net/netfilter/nf_conntrack_core.c > b/net/netfilter/nf_conntrack_core.c > index 1374179..e2c8bfe 100644 > --- a/net/netfilter/nf_conntrack_core.c > +++ b/net/netfilter/nf_conntrack_core.c > @@ -413,6 +413,11 @@ __nf_conntrack_confirm(struct sk_buff *skb) > > spin_lock_bh(&nf_conntrack_lock); > > + if (unlikely(nf_ct_is_dying(ct))) { > + spin_unlock_bh(&nf_conntrack_lock); > + return NF_ACCEPT; > + } > + > /* See if there's one in the list already, including reverse: > NAT could have grabbed it without realizing, since we're > not in the hash. If there is, we lost race. */ > -- 1.5.6.5 I think this patch is fine. Patrick, would you apply it and pass it to -stable, please? Thanks! Acked-by: Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> -- To unsubscribe from this list: send the line "unsubscribe netfilter" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html