On Tue, Jan 07, 2014 at 07:08:25AM -0800, Eric Dumazet wrote: > On Tue, 2014-01-07 at 14:31 +0400, Andrey Vagin wrote: > > Lets look at destroy_conntrack: > > > > hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); > > ... > > nf_conntrack_free(ct) > > kmem_cache_free(net->ct.nf_conntrack_cachep, ct); > > > > net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU. > > > > The hash is protected by rcu, so readers look up conntracks without > > locks. > > A conntrack is removed from the hash, but in this moment a few readers > > still can use the conntrack. Then this conntrack is released and another > > thread creates conntrack with the same address and the equal tuple. > > After this a reader starts to validate the conntrack: > > * It's not dying, because a new conntrack was created > > * nf_ct_tuple_equal() returns true. > > > > But this conntrack is not initialized yet, so it can not be used by two > > threads concurrently. In this case BUG_ON may be triggered from > > nf_nat_setup_info(). > > > > Florian Westphal suggested to check the confirm bit too. I think it's > > right. > > > > task 1 task 2 task 3 > > nf_conntrack_find_get > > ____nf_conntrack_find > > destroy_conntrack > > hlist_nulls_del_rcu > > nf_conntrack_free > > kmem_cache_free > > __nf_conntrack_alloc > > kmem_cache_alloc > > memset(&ct->tuplehash[IP_CT_DIR_MAX], > > if (nf_ct_is_dying(ct)) > > if (!nf_ct_tuple_equal() > > > > I'm not sure, that I have ever seen this race condition in a real life. > > Currently we are investigating a bug, which is reproduced on a few node. > > In our case one conntrack is initialized from a few tasks concurrently, > > we don't have any other explanation for this. > > > > > Cc: Florian Westphal <fw@xxxxxxxxx> > > Cc: Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> > > Cc: Patrick McHardy <kaber@xxxxxxxxx> > > Cc: Jozsef Kadlecsik <kadlec@xxxxxxxxxxxxxxxxx> > > Cc: "David S. Miller" <davem@xxxxxxxxxxxxx> > > Cc: Cyrill Gorcunov <gorcunov@xxxxxxxxxx> > > Signed-off-by: Andrey Vagin <avagin@xxxxxxxxxx> > > --- > > net/netfilter/nf_conntrack_core.c | 6 +++++- > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c > > index 43549eb..7a34bb2 100644 > > --- a/net/netfilter/nf_conntrack_core.c > > +++ b/net/netfilter/nf_conntrack_core.c > > @@ -387,8 +387,12 @@ begin: > > !atomic_inc_not_zero(&ct->ct_general.use))) > > h = NULL; > > else { > > + /* A conntrack can be recreated with the equal tuple, > > + * so we need to check that the conntrack is initialized > > + */ > > if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) || > > - nf_ct_zone(ct) != zone)) { > > + nf_ct_zone(ct) != zone) || > > + !nf_ct_is_confirmed(ct)) { > > nf_ct_put(ct); > > goto begin; > > } > > I do not think this is the right way to fix this problem (if said > problem is confirmed) > > Remember the rule about SLAB_DESTROY_BY_RCU : > > When a struct is freed, then reused, its important to set the its refcnt > (from 0 to 1) only when the structure is fully ready for use. > > If a lookup finds a structure which is not yet setup, the > atomic_inc_not_zero() will fail. > > Take a look at sk_clone_lock() and Documentation/RCU/rculist_nulls.txt > I have one more question. Looks like I found another problem. init_conntrack: hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode, &net->ct.unconfirmed); __nf_conntrack_hash_insert: hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode, &net->ct.hash[hash]); We use one hlist_nulls_node to add a conntrack into two different lists. As far as I understand, it's unacceptable in case of SLAB_DESTROY_BY_RCU. Lets imagine that we have two threads. The first one enumerates objects from a first list, the second one recreates an object and insert it in a second list. If a first thread in this moment stays on the object, it can read "next", when it's in the second list, so it will continue to enumerate objects from the second list. It is not what we want, isn't it? cpu1 cpu2 hlist_nulls_for_each_entry_rcu(ct) destroy_conntrack kmem_cache_free init_conntrack hlist_nulls_add_head_rcu ct = ct->next -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html