Jozsef Kadlecsik <kadlec@xxxxxxxxxxxxx> wrote: > Linkui Xiao reported that there's a race condition when ipset swap and destroy is > called, which can lead to crash in add/del/test element operations. Swap then > destroy are usual operations to replace a set with another one in a production > system. The issue can in some cases be reproduced with the script: Hi Jozsef, > iptables -A INPUT -m set --match-set hash_ip1 src -j ACCEPT > while [ 1 ] > do > # ... Ongoing traffic... > ipset create hash_ip2 hash:net family inet hashsize 1024 maxelem 1048576 > ipset add hash_ip2 172.20.0.0/16 > ipset swap hash_ip1 hash_ip2 > ipset destroy hash_ip2 > sleep 0.05 > done > > In the race case the possible order of the operations are > > CPU0 CPU1 > ip_set_test > ipset swap hash_ip1 hash_ip2 > ipset destroy hash_ip2 > hash_net_kadt > > Swap replaces hash_ip1 with hash_ip2 and then destroy removes hash_ip2 which > is the original hash_ip1. ip_set_test was called on hash_ip1 and because destroy > removed it, hash_net_kadt crashes. > > The fix is to protect both the list of the sets and the set pointers in an extended RCU > region and before exiting ip_set_swap(), wait to finish all started rcu_read_lock(). > The first version of the patch was written by Linkui Xiao <xiaolinkui@xxxxxxxxxx>. > > v2: synchronize_rcu() is moved into ip_set_swap() in order not to burden > ip_set_destroy() unnecessarily when all sets are destroyed. Hmm. Isn't it enough to only call synchronize_rcu() in ip_set_swap? All netfilter hooks run with rcu_read_lock() held, em_ipset.c wraps the entire ip_set_test() in rcu read lock/unlock pair. > @@ -704,13 +704,18 @@ ip_set_rcu_get(struct net *net, ip_set_id_t index) > struct ip_set_net *inst = ip_set_pernet(net); > > rcu_read_lock(); > - /* ip_set_list itself needs to be protected */ > + /* ip_set_list and the set pointer need to be protected */ > set = rcu_dereference(inst->ip_set_list)[index]; > - rcu_read_unlock(); > > return set; > } ... so I don't understand why ip_set_rcu_get() has to extend the locked section. AFAICS there are only two type of callers: 1. rcu read lock is already held (datapath) 2. ipset nfnl subsys mutex is held *probably* This could be changed in a separate patch to: - rcu_read_lock(); - /* ip_set_list itself needs to be protected */ - set = rcu_dereference(inst->ip_set_list)[index]; - rcu_read_unlock(); + /* ip_set_list and the set pointer need to be protected */ + return rcu_dereference_check(inst->ip_set_list, lockdep_nfnl_is_held(NFNL_SUBSYS_IPSET)[index]; This is an example, probably better to add a small "ip_set_dereference_nfnl" helper to hide the lockdep construct... Not saying the patch is wrong; rcu read locks nest and ipset locking is not simple so I might be missing something.