On Tue, 2014-01-07 at 14:31 +0400, Andrey Vagin wrote: > Lets look at destroy_conntrack: > > hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); > ... > nf_conntrack_free(ct) > kmem_cache_free(net->ct.nf_conntrack_cachep, ct); > > net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU. > > The hash is protected by rcu, so readers look up conntracks without > locks. > A conntrack is removed from the hash, but in this moment a few readers > still can use the conntrack. Then this conntrack is released and another > thread creates conntrack with the same address and the equal tuple. > After this a reader starts to validate the conntrack: > * It's not dying, because a new conntrack was created > * nf_ct_tuple_equal() returns true. > > But this conntrack is not initialized yet, so it can not be used by two > threads concurrently. In this case BUG_ON may be triggered from > nf_nat_setup_info(). > > Florian Westphal suggested to check the confirm bit too. I think it's > right. > > task 1 task 2 task 3 > nf_conntrack_find_get > ____nf_conntrack_find > destroy_conntrack > hlist_nulls_del_rcu > nf_conntrack_free > kmem_cache_free > __nf_conntrack_alloc > kmem_cache_alloc > memset(&ct->tuplehash[IP_CT_DIR_MAX], > if (nf_ct_is_dying(ct)) > if (!nf_ct_tuple_equal() > > I'm not sure, that I have ever seen this race condition in a real life. > Currently we are investigating a bug, which is reproduced on a few node. > In our case one conntrack is initialized from a few tasks concurrently, > we don't have any other explanation for this. > > Cc: Florian Westphal <fw@xxxxxxxxx> > Cc: Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> > Cc: Patrick McHardy <kaber@xxxxxxxxx> > Cc: Jozsef Kadlecsik <kadlec@xxxxxxxxxxxxxxxxx> > Cc: "David S. Miller" <davem@xxxxxxxxxxxxx> > Cc: Cyrill Gorcunov <gorcunov@xxxxxxxxxx> > Signed-off-by: Andrey Vagin <avagin@xxxxxxxxxx> > --- > net/netfilter/nf_conntrack_core.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c > index 43549eb..7a34bb2 100644 > --- a/net/netfilter/nf_conntrack_core.c > +++ b/net/netfilter/nf_conntrack_core.c > @@ -387,8 +387,12 @@ begin: > !atomic_inc_not_zero(&ct->ct_general.use))) > h = NULL; > else { > + /* A conntrack can be recreated with the equal tuple, > + * so we need to check that the conntrack is initialized > + */ > if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) || > - nf_ct_zone(ct) != zone)) { > + nf_ct_zone(ct) != zone) || > + !nf_ct_is_confirmed(ct)) { > nf_ct_put(ct); > goto begin; > } I do not think this is the right way to fix this problem (if said problem is confirmed) Remember the rule about SLAB_DESTROY_BY_RCU : When a struct is freed, then reused, its important to set the its refcnt (from 0 to 1) only when the structure is fully ready for use. If a lookup finds a structure which is not yet setup, the atomic_inc_not_zero() will fail. Take a look at sk_clone_lock() and Documentation/RCU/rculist_nulls.txt -- To unsubscribe from this list: send the line "unsubscribe netfilter" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html