> > I did not say that you should do anything, I just give an option (like > > branch) to let you quickly see changes I made if you wish. Maybe you > > find them totally useless:) > > I see - a branch costs nothing, so that's no problem. Then create a > numbered list of the patches, put them in a tar file and let me know from > where I can download it. Thanks Jozsef, I have prepare patches for you and put link to the series. http://ftp.payport.com.ua/users/sergey/public/netfilter-devel-2015031701.tar These patches will be against nf-next for simplicity and I'm already rebase them, so for first review just fresh nf-next will be needed. > > > > For example quick overview of the current RCU implementation in the > > > > ip_set_core.c gives no hint on how all further readers are prevented > > > > from entering set before it is being destroyed (i.e. we can destroy > > > > core set data structures with ip_set_destroy_set() while they being > > > > accessed by the ip_set_test(). > > > > > > That's excluded by the set references: set can be destroyed only when > > > it's > > > reference counter is zero and ip_set_test() can be called for referenced > > > sets only. > > > > I did not agree with this since ip_set.c:ip_set_destroy_set() after > > checking of reference counters is called, but ip_set(inst, index) = NULL > > (which itself looks unclear since pointer assignments alone not > > guaranteed to be atomic, thus rcu_assign_pointer() and friends exists) > > used without any RCU grace period waiters like synchronize_rcu(). > > No. Please check xt_set.c and all the checkentry functions: sets are > referenced there and until such reference is not created, no > ip_set_test|add|del() function can use the set. The reference is released > in the match/target destroy functions - and the set can be destroyed only > after that. When the pointer assigment happens above, there cannot be any > user of the set, there's no need for any protection. If you mean ip_set*nfnl*() then they just needed to be used in the properly written code (such as xt_set.c and em_ipset.c). However they just reference set to prevent it from deletion. Set already published in netns ip_set_list and available to ip_set_add() and ip_set_del() (sure anyone should not call these functions from the kernel without first referencing the set with ip_set*nfnl*()). On the other hand, code using ip_set_{add,del,test}() should also ensure there is no callers to such ones after ip_set*nfnl*(). I reviewed all of such code in xt_set.c and em_ipset.c and you are right. Im just try to be more paranoid in the subsystem implementation. > > > Yes I see rcu_barrier() is used in each set module, however it is more > > for waiting of callbacks to complete (or kfree_rcu). > > It's called from the module unload path. Actually, at that time there > isn't any set belonging to the given set type (because otherwise the > module could not be removed). I think it is not necessary at all if no outstanding rcu callbacks are used in favor to kfree_rcu() which itselfs rcu callback to kfree() to just free data without any actions taken (for which rcu callbacks intended). In my series callbacks used to destroy memory cache allocations and extensions. I like to take full copy of the bucket before modification with extensions and not rely on set_bit()/clear_bit() it gives on guarantee on reordering on arches other than x86 (see comment in include/asm-generic/atomic.h). So adding new element takes copy of the full bucket (which is up to 12 positions for all hash:* except netiface, where it 64 positions) and extensions. Yes, it seems expensive, but correct in all aspects including extensions. Previous copies destroyed with rcu callbacks. Callbacks itself wery simple pose minimal overhead. I split implementation of bitmap:* sets to the two variants: ordinary array of bits when set created w/o extensions and array of pointers (hash table w/o collisions) for sets w/ extensions, since having bitmap:* with extensions anyway uses array of extensions data and accessing pointer is a cheap on most platforms. This also simplifies my RCU implementation. For bitmap:* where bitmap (array of bits, no extensions) is used there is no RCU needed since whenever we have bit set or bit clear for concurrent readers on modify does not play any significant role. For bitmap:* implemented as array of pointers (element is in set if pointer points to the data) this is much like for hash:* sets. For list:set it is standard linux double linked list RCU variant. Destroy procedure is split into parts: hide set from the new readers flush all elements from the set, and mark it as inactive under set lock, so other writers (ip_set_test() -EAGAIN for bitmap:ipmac and ip_set_{add,del}() and gc routine) won't modify it. Current readers may still use set data. wait for RCU grace period to let readers to finish with set structures wait for RCU callbacks to complete (after flushing elements) finally destroy set data without any lock. All tests I made are done on relatively ancient hardware (Pentium D 2.8Ghz, DDR 2 for SMP x86, x86_64 and PIII 1Ghz for UP). > > There is other things to consider: ip_set_add()/ip_set_del() and of course > > timeout enabled set gc routine (I known del_timer_sync() is the first > > in ->destroy() method). > > > > set_bit(j, n->used); > > > > if (old != ERR_PTR(-ENOENT)) { > > > > rcu_assign_pointer(hbucket(t, key), n); > > > > This looks as data publishing prior to using it where bit is set before > > pointer to the new data assigned to the bucket. > > When new hashbucket is created, the set_bit() call is actually an > initialization. Yes, sorry for that, n is an unpublished bucket. Correct. But what about memcpy()? Extensions assignment? These operations is not atomic. > > > Calling ip_set_ext_destroy() concurrently with readers seems also not > > friendly. Currently this is only for commens, but if any other extension, > > used by the readers is added problems will arise. > > As I mentioned, that's the only part which needs fixing. > > > In general any modification to the data should be consistent for > > concurrent readers all times, and should be only modified/released after > > RCU grace period (or in case of RCU-bh after one softirq scheduling). > > Modification data inline seems worse to me. > > When data can safely be modified, then that's the better approach. > > > > > Also I don't think there is a good idea to modify cidr book keeping > > > > data without any protection from the concurrent readers. > > > > > > cidr book keeping was changed so that it is safe to modify it without > > > any protection. The worst case is that a reader loops at the same cidr > > > value twice. > > > > Anyway this seems worse to me. > > Why? No data corruption can happen and that's the fastest way. The worst > case scenario is not the common case. > > > > The main point of the ipset git tree is to keep and maintain support > > > kernel versions back to the last (i.e. 2.6.32) longterm kernel tree. > > > > However debugging locking issues without lockdep is much harder and > > having one when it is available seems good. > > It can probably be added in a wrapper which gives noop for older kernels. Yes, supporting old kernels is a good choise. On the other hand I think that it would be nice to have branch, similar to the nf-next (say ipset-next) for supplying patches to the upstream and make all code in this branch inline with upstream kernel. I think this eases work for supplying patches from ipset to nf-next, or request for pull may be send to mailing list instead of individual patches (this is how nf-next pulled by DM into net-next). Then new changes from the ipset-next could be backported into ipset git tree (say it would be named as ipset-stable to the analogy with linux-stable etc.). > > Best regards, > Jozsef > - > E-mail : kadlec@xxxxxxxxxxxxxxxxx, kadlecsik.jozsef@xxxxxxxxxxxxx > PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt > Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences > H-1525 Budapest 114, POB. 49, Hungary -- SP5474-RIPE Sergey Popovich -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html