Re: Kernel panic nf_nat_setup_info+0x5b3/0x6e0

Eric Dumazet <eric.dumazet@xxxxxxxxx> · Fri, 01 Apr 2011 07:30:40 +0200

Le vendredi 01 avril 2011 Ã 10:02 +0800, Changli Gao a Ãcrit :
> On Thu, Mar 31, 2011 at 10:47 PM, Eric Dumazet <eric.dumazet@xxxxxxxxx> wrote:
> >
> > I wonder if this is not hiding another bug.
> >
> > Adding an RCU grace period might reduce the probability window.
> 
> What bug? This one?
> 

Yes. I am saying your patch is a brown paper bag.

Real bug is elsewhere, and we must find it, not make it happen less
frequently.

Maybe its a missing barrier, and adding a full RCU grace period is a
waste (of cpu caches space, since call_rcu() fill a potential big list)

> >
> > By the time nf_conntrack_free(ct) is called, no other cpu/thread
> > could/should use ct, or ct->ext ?
> 
> nat->ct may refer it.
> 

conntrack must have a real problem of refcounting then.

Each time an object A has a reference to object B, we should increase B
refcount, so B cannot disappear.

nat->ct _cannot_ refer ct if we are freeing ct. Thats quite simple rule.

> >
> > Sure, another thread can find/pass_on ct in a lookup but should not use
> > it, since its refcount (ct_general.use) should be 0.
> >
> >
> 
> As SLAB_DESTROY_BY_RCU is used, we should validate ct->orig_tuple too.
> 
> There is another minor problem.
> 

minor or serious ?

> nf_nat_core.c:
> 
> 133         rcu_read_lock();
> 134         hlist_for_each_entry_rcu(nat, n,
> &net->ipv4.nat_bysource[h], bysourc    e) {
> 135                 ct = nat->ct;
> 136                 if (same_src(ct, tuple) && nf_ct_zone(ct) == zone) {
> 137                         /* Copy source part from reply tuple. */
> 138                         nf_ct_invert_tuplepr(result,
> 139
> &ct->tuplehash[IP_CT_DIR_REPLY].tuple    );
> 140                         result->dst = tuple->dst;
> 141
> 142                         if (in_range(result, range)) {
> 143                                 rcu_read_unlock();
> 144                                 return 1;
> 145                         }
> 146                 }
> 147         }
> 148         rcu_read_unlock();
> 
> If the ct is reused, NAT mapping will be wrong. It isn't a serious
> problem, and can't be fixed, even though we check the reference
> counter before using it, but we can't validate it with the original
> tuple.
> 

I call this a serious problem. netfilter is not a fuzzy logic.

> Maybe we can do it in this way
> 
>                 ct = nat->ct;
>                 if (!nf_ct_is_dying(ct) &&
> atomic_inc_not_zero(&ct->ct_general.use))) {
>                                if (nf_ct_ext_find(ct, NF_CT_EXT_NAT) == nat) {
>                                         /* ct is valid, do sth... */
>                                }
>                                nf_ct_put(ct);
>                 }
> 
> I think two additional atomic operations are expensive. It isn't a good idea.
> 

It depends. This is better than taking the conntrack lock.

SLAB_DESTROY_BY_RCU is not allowing for fuzzy logic. Rules are we _must_
take a reference on object after finding it, and _recheck_ validity of
the object before using it.

Another way to avoid atomic operations in find_appropriate_src() is to
use a seqcount (or seqlock), and change the seqcount every time
something is changed in ct.

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html