Hi Patrick, On Tue, Feb 19, 2013 at 11:02:58PM +0100, Patrick McHardy wrote: > On Thu, Jan 31, 2013 at 01:03:59AM +0100, pablo@xxxxxxxxxxxxx wrote: > > From: Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> > > > > This patch adds global atomic rule updates and dumps based on > > bitmask generations. This allows to atomically commit a set of > > rule-set updates incrementally without altering the internal > > state of existing nf_tables expressions/matches/targets. > > > > The idea consists of using a generation cursor of 1 bit and > > a bitmask of 2 bits per rule. Assuming the gencursor is 0, > > then the genmask (expressed as a bitmask) can be interpreted > > as: > > > > 00 active in the present, will be active in the next generation. > > 01 inactive in the present, will be active in the next generation. > > 10 active in the present, will be deleted in the next generation. > > ^ > > gencursor > > > > Once you invoke the transition to the next generation, the global > > gencursor is updated: > > > > 00 active in the present, will be active in the next generation. > > 01 active in the present, needs to zero its future, it becomes 00. > > 10 inactive in the present, delete now. > > ^ > > gencursor > > > > If a dump is in progress and nf_tables enters a new generation, > > the dump will stop and return -EBUSY to let userspace know that > > it has to retry again. In order to invalidate dumps, a global > > genctr counter is increased everytime nf_tables enters a new > > generation. > > > > This new operation can be used from the user-space utility > > that controls the firewall, eg. > > > > nft restore < file > > > > The rule updates contained in `file' will be applied atomically. > > > > cat file > > ----- > > add filter INPUT ip saddr 1.1.1.1 counter accept #1 > > del filter INPUT ip daddr 2.2.2.2 counter drop #2 > > commit #3 > > -EOF- > > > > Note that the rule 1 will be inactive until the transition to the > > next generation, the rule 2 will be evicted in the next generation. > > > > There is a penalty during the rule update due to the branch > > misprediction in the packet matching framework. But that should be > > quickly resolved once the iteration over the dirty list that > > contain rules that require updates is finished. > > > > Event notification happens once the rule-set update has been > > committed. So we skip notifications is case the rule-set update > > is aborted, which can happen in case that the rule-set is tested > > to apply correctly. > > > > This patch is based on ideas extracted from discussions with > > Patrick McHardy. > > Generally, I think this has a pretty high penalty both memory-wise > and performance-wise, so we should reconsider the idea of moving > most of the costs to userspace by temporarily adding "skip" rules > that skip over uncommitted rules. > > A couple of comments on the patch below though. > > > @@ -37,6 +37,8 @@ enum nf_tables_msg_types { > > NFT_MSG_NEWSETELEM, > > NFT_MSG_GETSETELEM, > > NFT_MSG_DELSETELEM, > > + NFT_MSG_COMMIT, > > + NFT_MSG_ABORT, > > I see why you're doing this, but this is similar to the NEWCHAINNAME > attribute, it doesn't really fit into the scheme how the netlink API > is designed. What would you suggest as an alternative? > > index 3ba63b6..1131e49 100644 > > --- a/include/net/netfilter/nf_tables.h > > +++ b/include/net/netfilter/nf_tables.h > > @@ -319,15 +319,19 @@ static inline void *nft_expr_priv(const struct nft_expr *expr) > > * struct nft_rule - nf_tables rule > > * > > * @list: used internally > > + * @dirty_list: this rule needs an update after new generation > > * @rcu_head: used internally for rcu > > * @handle: rule handle > > + * @genmask: generation mask > > * @dlen: length of expression data > > * @data: expression data > > */ > > struct nft_rule { > > struct list_head list; > > + struct list_head dirty_list; > > This is a quite heavy price. To reduce memory overhead, we could just have > dirty chains and iterate over all rules within them. We can save that the extra struct list_head for the dirty_list in the rule by allocating temporary container structure, I have a follow up patch for that. > > diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c > > index db6150b..7f08381 100644 > > --- a/net/netfilter/nf_tables_api.c > > +++ b/net/netfilter/nf_tables_api.c > > static int nf_tables_dump_rules(struct sk_buff *skb, > > struct netlink_callback *cb) > > { > > @@ -1250,6 +1288,7 @@ static int nf_tables_dump_rules(struct sk_buff *skb, > > unsigned int idx = 0, s_idx = cb->args[0]; > > struct net *net = sock_net(skb->sk); > > int family = nfmsg->nfgen_family; > > + unsigned int genctr = net->nft.genctr, gencursor = net->nft.gencursor; > > > > list_for_each_entry(afi, &net->nft.af_info, list) { > > if (family != NFPROTO_UNSPEC && family != afi->family) > > @@ -1258,6 +1297,8 @@ static int nf_tables_dump_rules(struct sk_buff *skb, > > list_for_each_entry(table, &afi->tables, list) { > > list_for_each_entry(chain, &table->chains, list) { > > list_for_each_entry(rule, &chain->rules, list) { > > + if (!nft_rule_is_active(net, rule)) > > + goto cont; > > Not sure whether we should really skip them during the dump. That hides > information from userspace, we could just as well include an INACTIVE flag > and have userspace decide whether to process them or not. Agreed, they should be dumped. > > +static void > > +nf_tables_delrule_one(struct nft_ctx *ctx, struct nft_rule *rule, u32 flags) > > +{ > > + if (flags & NFT_RULE_F_COMMIT) { > > + struct nft_chain *chain = (struct nft_chain *)ctx->chain; > > + > > + nft_rule_disactivate_next(ctx->net, rule); > > deactivate :) Hm, I swear I checked this in the dictionary before writing it, you know how broken my English can be ;-) > > + list_add(&rule->dirty_list, &chain->dirty_rules); > > + } else { > > + list_del_rcu(&rule->list); > > + nf_tables_rule_notify(ctx->skb, ctx->nlh, ctx->table, > > + ctx->chain, rule, NFT_MSG_DELRULE, > > + 0, ctx->afi->family); > > + nf_tables_rule_destroy(rule); > > + } > > +} > > + > > static int nf_tables_delrule(struct sock *nlsk, struct sk_buff *skb, > > const struct nlmsghdr *nlh, > > const struct nlattr * const nla[]) > > +static int nf_tables_commit(struct sock *nlsk, struct sk_buff *skb, > > + const struct nlmsghdr *nlh, > > + const struct nlattr * const nla[]) > > +{ > > + const struct nfgenmsg *nfmsg = nlmsg_data(nlh); > > + const struct nft_af_info *afi; > > + struct net *net = sock_net(skb->sk); > > + struct nft_table *table; > > + struct nft_chain *chain; > > + struct nft_rule *rule, *tmp; > > + int family = nfmsg->nfgen_family; > > + bool create; > > + > > + create = nlh->nlmsg_flags & NLM_F_CREATE ? true : false; > > + > > + afi = nf_tables_afinfo_lookup(net, nfmsg->nfgen_family, create); > > + if (IS_ERR(afi)) > > + return PTR_ERR(afi); > > + > > + /* Bump generation counter, invalidate any dump in progress */ > > + net->nft.genctr++; > > + > > + /* A new generation has just started */ > > + net->nft.gencursor++; > > + > > + /* Make sure all packets have left the previous generation before > > + * purging old rules. > > + */ > > + synchronize_rcu(); > > This doesn't work. As soon as synchronize_rcu() is finished, new packets > might enter nft_do_chain(). And this also seems to be the main problem > with this approach, you can't iteratively flip the active bits without > locking out packet processing, otherwise it will be just as non-atomic as > it is now. For this to work you'd need a bigger generation counter in the > rules that doesn't overflow and doesn't require changes to its values. The gencursor is being cached at the entrace of nft_do_chain_pktinfo to avoid that intermediate state. So AFAICS we wait until old packets have finished iterating over the rule-set with the old gencursor and, then, new packets will see the new rule-set using the new gencursor. > > + list_for_each_entry(table, &afi->tables, list) { > > + list_for_each_entry(chain, &table->chains, list) { > > + list_for_each_entry_safe(rule, tmp, &chain->dirty_rules, dirty_list) { > > + /* Delete this rule from the dirty list */ > > + list_del(&rule->dirty_list); > > + > > + /* This rule was inactive in the past and just > > + * became active. Clear the next bit of the > > + * genmask since its meaning has changed, now > > + * it is the future. > > + */ > > + if (nft_rule_is_active(net, rule)) { > > + nft_rule_clear(net, rule); > > + nf_tables_rule_notify(skb, nlh, table, > > + chain, rule, > > + NFT_MSG_NEWRULE, > > + 0, > > + nfmsg->nfgen_family); > > + continue; > > + } > > + > > + /* This rule is in the past, get rid of it */ > > + list_del_rcu(&rule->list); > > + nf_tables_rule_notify(skb, nlh, table, chain, > > + rule, NFT_MSG_DELRULE, 0, > > + family); > > + nf_tables_rule_destroy(rule); > > + } -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html