Patrick McHardy wrote: > Pablo Neira Ayuso wrote: >> Patrick McHardy wrote: >>> We have at least one case where the caller wants to know of >>> any successful delivery. Keymanager queries done by xfrm_state >>> want to know whether an acquire was delivered to any keymanager. >>> So we need to continue to indicate this, maybe using a different >>> errno code than -ENOBUFS. I don't have a suggestion which one to >>> use though. >> >> Indeed, I have missed that spot. I'm not very familiar with that code, >> however, I see that the creation of a state depends on the netlink >> broadcast return value, but how useful is that? I think that the state >> should be created even if the broadcast fails, the userspace daemon >> should request a resync to the kernel as soon as it hits ENOBUFS, then >> it would be in sync again with that state. > > The idea is that the kernel is performing an active query. I agree > that there's nothing wrong with installing the SA and indicating the > error to userspace. Userspace could dump the SADB and look for new > larval states, however thats unlikely to be very useful since once > an overflow occurs, you probably have a lot of states. More situations may trigger overflows: a "slow" reader (for example, spending time on whatever while not retrieving messages) and a userspace process with too small receive buffer. > But unless I'm missing something, there's nothing wrong with this > as long as the error is ignored. The fact that something was received > by some listener doesn't have any meaning anyways, it might have > been "ip monitor". Which somehow raises doubt about your proposed > interface change though, I think anything that wants a reliable > answer whether a packet was delivered to a process handling it > appropriately should use unicast. Don't get me wrong, I agree with you that all netlink_broadcast callers in the kernel should ignore the return value... ... unless they have "some way" (like in Netfilter) to make event delivery reliable: I have attached a patch that I didn't send you yet, I'm still reviewing and testing it. It adds an entry to /proc to enable reliable event delivery over netlink by dropping packets whose events were not delivered, you mentioned that possibility once during one of our conversations ;). I'm aware of that this option may be dangerous if used by a buggy process that trigger frequent overflows but it the cost of having realible logging for ctnetlink (still, this behaviour is not the one by default!). And I need this option to make conntrackd synchronize state-changes appropriately under very heavy load: I've testing the daemon with these patches and it reliably synchronizes state-changes (my system were 100% busy filtering traffic and fully synchronizing all TCP state-changes in near real-time effort, with a noticeable performance drop of 30% in terms of filtered connections). -- "Los honestos son inadaptados sociales" -- Les Luthiers
ctnetlink: optional packet drop to make event delivery reliable From: Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> This patch adds /proc entry to enable reliable ctnetlink event delivery. The entry is located at: /proc/sys/net/netfilter/nf_conntrack_netlink_broadcast_reliable When this entry is != 0, ctnetlink drops the packet if the delivery of an event over netlink fails. This patch is useful to provide reliable state synchronization for conntrackd. Signed-off-by: Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> --- include/linux/netfilter/nfnetlink.h | 4 + include/net/netfilter/nf_conntrack_core.h | 6 +- include/net/netfilter/nf_conntrack_ecache.h | 2 - include/net/netns/conntrack.h | 2 + net/netfilter/nf_conntrack_ecache.c | 18 +++-- net/netfilter/nf_conntrack_netlink.c | 108 ++++++++++++++++++++++++++- net/netfilter/nfnetlink.c | 24 +++++- 7 files changed, 146 insertions(+), 18 deletions(-) diff --git a/include/linux/netfilter/nfnetlink.h b/include/linux/netfilter/nfnetlink.h index 7d8e045..b89d5f3 100644 --- a/include/linux/netfilter/nfnetlink.h +++ b/include/linux/netfilter/nfnetlink.h @@ -74,8 +74,8 @@ extern int nfnetlink_subsys_register(const struct nfnetlink_subsystem *n); extern int nfnetlink_subsys_unregister(const struct nfnetlink_subsystem *n); extern int nfnetlink_has_listeners(unsigned int group); -extern int nfnetlink_send(struct sk_buff *skb, u32 pid, unsigned group, - int echo); +extern int nfnetlink_notify(struct sk_buff *skb, u32 pid, unsigned group, + int echo); extern int nfnetlink_unicast(struct sk_buff *skb, u_int32_t pid, int flags); extern void nfnl_lock(void); diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h index e78afe7..0c6826d 100644 --- a/include/net/netfilter/nf_conntrack_core.h +++ b/include/net/netfilter/nf_conntrack_core.h @@ -62,7 +62,11 @@ static inline int nf_conntrack_confirm(struct sk_buff *skb) if (ct) { if (!nf_ct_is_confirmed(ct) && !nf_ct_is_dying(ct)) ret = __nf_conntrack_confirm(skb); - nf_ct_deliver_cached_events(ct); + if (ret == NF_ACCEPT && nf_ct_deliver_cached_events(ct) < 0) { + struct net *net = nf_ct_net(ct); + NF_CT_STAT_INC_ATOMIC(net, drop); + return NF_DROP; + } } return ret; } diff --git a/include/net/netfilter/nf_conntrack_ecache.h b/include/net/netfilter/nf_conntrack_ecache.h index 0ff0dc6..6e9e1f7 100644 --- a/include/net/netfilter/nf_conntrack_ecache.h +++ b/include/net/netfilter/nf_conntrack_ecache.h @@ -28,7 +28,7 @@ extern struct atomic_notifier_head nf_conntrack_chain; extern int nf_conntrack_register_notifier(struct notifier_block *nb); extern int nf_conntrack_unregister_notifier(struct notifier_block *nb); -extern void nf_ct_deliver_cached_events(const struct nf_conn *ct); +extern int nf_ct_deliver_cached_events(const struct nf_conn *ct); extern void __nf_ct_event_cache_init(struct nf_conn *ct); extern void nf_ct_event_cache_flush(struct net *net); diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h index f4498a6..1ff61dd 100644 --- a/include/net/netns/conntrack.h +++ b/include/net/netns/conntrack.h @@ -20,9 +20,11 @@ struct netns_ct { int sysctl_acct; int sysctl_checksum; unsigned int sysctl_log_invalid; /* Log invalid packets */ + int sysctl_ctnetlink_event_reliable; #ifdef CONFIG_SYSCTL struct ctl_table_header *sysctl_header; struct ctl_table_header *acct_sysctl_header; + struct ctl_table_header *ctnetlink_sysctl_header; #endif int hash_vmalloc; int expect_vmalloc; diff --git a/net/netfilter/nf_conntrack_ecache.c b/net/netfilter/nf_conntrack_ecache.c index dee4190..9c21269 100644 --- a/net/netfilter/nf_conntrack_ecache.c +++ b/net/netfilter/nf_conntrack_ecache.c @@ -31,9 +31,11 @@ EXPORT_SYMBOL_GPL(nf_ct_expect_chain); /* deliver cached events and clear cache entry - must be called with locally * disabled softirqs */ -static inline void +static inline int __nf_ct_deliver_cached_events(struct nf_conntrack_ecache *ecache) { + int ret = 0; + if (nf_ct_is_confirmed(ecache->ct) && !nf_ct_is_dying(ecache->ct) && ecache->events) { struct nf_ct_event item = { @@ -42,28 +44,32 @@ __nf_ct_deliver_cached_events(struct nf_conntrack_ecache *ecache) .report = 0 }; - atomic_notifier_call_chain(&nf_conntrack_chain, - ecache->events, - &item); + ret = atomic_notifier_call_chain(&nf_conntrack_chain, + ecache->events, + &item); + ret = notifier_to_errno(ret); } ecache->events = 0; nf_ct_put(ecache->ct); ecache->ct = NULL; + return ret; } /* Deliver all cached events for a particular conntrack. This is called * by code prior to async packet handling for freeing the skb */ -void nf_ct_deliver_cached_events(const struct nf_conn *ct) +int nf_ct_deliver_cached_events(const struct nf_conn *ct) { struct net *net = nf_ct_net(ct); struct nf_conntrack_ecache *ecache; + int ret = 0; local_bh_disable(); ecache = per_cpu_ptr(net->ct.ecache, raw_smp_processor_id()); if (ecache->ct == ct) - __nf_ct_deliver_cached_events(ecache); + ret = __nf_ct_deliver_cached_events(ecache); local_bh_enable(); + return ret; } EXPORT_SYMBOL_GPL(nf_ct_deliver_cached_events); diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index 47c2f54..3e0ffb6 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -517,6 +517,8 @@ static int ctnetlink_conntrack_event(struct notifier_block *this, unsigned int type; sk_buff_data_t b; unsigned int flags = 0, group; + struct net *net = nf_ct_net(ct); + int err; /* ignore our fake conntrack entry */ if (ct == &nf_conntrack_untracked) @@ -613,13 +615,20 @@ static int ctnetlink_conntrack_event(struct notifier_block *this, rcu_read_unlock(); nlh->nlmsg_len = skb->tail - b; - nfnetlink_send(skb, item->pid, group, item->report); + err = nfnetlink_notify(skb, item->pid, group, item->report); + if (net->ct.sysctl_ctnetlink_event_reliable && + (err == -ENOBUFS || err == -EAGAIN)) + return notifier_from_errno(err); + return NOTIFY_DONE; nla_put_failure: rcu_read_unlock(); nlmsg_failure: kfree_skb(skb); + if (net->ct.sysctl_ctnetlink_event_reliable) + return notifier_from_errno(-ENOSPC); + return NOTIFY_DONE; } #endif /* CONFIG_NF_CONNTRACK_EVENTS */ @@ -1604,7 +1613,8 @@ static int ctnetlink_expect_event(struct notifier_block *this, struct sk_buff *skb; unsigned int type; sk_buff_data_t b; - int flags = 0; + int flags = 0, err; + struct net *net = nf_ct_exp_net(exp); if (events & IPEXP_NEW) { type = IPCTNL_MSG_EXP_NEW; @@ -1637,13 +1647,21 @@ static int ctnetlink_expect_event(struct notifier_block *this, rcu_read_unlock(); nlh->nlmsg_len = skb->tail - b; - nfnetlink_send(skb, item->pid, NFNLGRP_CONNTRACK_EXP_NEW, item->report); + err = nfnetlink_notify(skb, item->pid, NFNLGRP_CONNTRACK_EXP_NEW, + item->report); + if (net->ct.sysctl_ctnetlink_event_reliable && + (err == -ENOBUFS || err == -EAGAIN)) + return notifier_from_errno(err); + return NOTIFY_DONE; nla_put_failure: rcu_read_unlock(); nlmsg_failure: kfree_skb(skb); + if (net->ct.sysctl_ctnetlink_event_reliable) + return notifier_from_errno(-ENOSPC); + return NOTIFY_DONE; } #endif @@ -2003,7 +2021,63 @@ MODULE_ALIAS("ip_conntrack_netlink"); MODULE_ALIAS_NFNL_SUBSYS(NFNL_SUBSYS_CTNETLINK); MODULE_ALIAS_NFNL_SUBSYS(NFNL_SUBSYS_CTNETLINK_EXP); -static int __init ctnetlink_init(void) +#ifdef CONFIG_SYSCTL +static struct ctl_table ctnetlink_sysctl_table[] = { + { + .ctl_name = CTL_UNNUMBERED, + .procname = "nf_conntrack_netlink_broadcast_reliable", + .data = &init_net.ct.sysctl_ctnetlink_event_reliable, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + {} +}; + +static int ctnetlink_init_sysctl(struct net *net) +{ + struct ctl_table *table; + + table = kmemdup(ctnetlink_sysctl_table, sizeof(ctnetlink_sysctl_table), + GFP_KERNEL); + if (!table) + goto out; + + table[0].data = &net->ct.sysctl_ctnetlink_event_reliable; + + net->ct.ctnetlink_sysctl_header = register_net_sysctl_table(net, + nf_net_netfilter_sysctl_path, table); + if (!net->ct.ctnetlink_sysctl_header) + goto out_register; + + return 0; + +out_register: + kfree(table); +out: + return -ENOMEM; +} + +static void ctnetlink_fini_sysctl(struct net *net) +{ + struct ctl_table *table; + + table = net->ct.ctnetlink_sysctl_header->ctl_table_arg; + unregister_net_sysctl_table(net->ct.ctnetlink_sysctl_header); + kfree(table); +} +#else +static int ctnetlink_init_sysctl(struct net *net) +{ + return 0; +} + +static void ctnetlink_fini_sysctl(struct net *net) +{ +} +#endif /* CONFIG_SYSCTL */ + +static int ctnetlink_net_init(struct net *net) { int ret; @@ -2033,10 +2107,18 @@ static int __init ctnetlink_init(void) goto err_unreg_notifier; } #endif + ret = ctnetlink_init_sysctl(net); + if (ret < 0) { + printk("ctnetlink_init: cannot register sysctl.\n"); + goto err_unreg_exp_notifier; + } + net->ct.sysctl_ctnetlink_event_reliable = 0; return 0; #ifdef CONFIG_NF_CONNTRACK_EVENTS +err_unreg_exp_notifier: + nf_ct_expect_unregister_notifier(&ctnl_notifier_exp); err_unreg_notifier: nf_conntrack_unregister_notifier(&ctnl_notifier); err_unreg_exp_subsys: @@ -2048,7 +2130,7 @@ err_out: return ret; } -static void __exit ctnetlink_exit(void) +static void ctnetlink_net_exit(struct net *net) { printk("ctnetlink: unregistering from nfnetlink.\n"); @@ -2059,8 +2141,24 @@ static void __exit ctnetlink_exit(void) nfnetlink_subsys_unregister(&ctnl_exp_subsys); nfnetlink_subsys_unregister(&ctnl_subsys); + ctnetlink_fini_sysctl(net); return; } +static struct pernet_operations ctnetlink_net_ops = { + .init = ctnetlink_net_init, + .exit = ctnetlink_net_exit, +}; + +static int __init ctnetlink_init(void) +{ + return register_pernet_subsys(&ctnetlink_net_ops); +} + +static void __exit ctnetlink_exit(void) +{ + unregister_pernet_subsys(&ctnetlink_net_ops); +} + module_init(ctnetlink_init); module_exit(ctnetlink_exit); diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c index 9c0ba17..fd7bbf4 100644 --- a/net/netfilter/nfnetlink.c +++ b/net/netfilter/nfnetlink.c @@ -107,11 +107,29 @@ int nfnetlink_has_listeners(unsigned int group) } EXPORT_SYMBOL_GPL(nfnetlink_has_listeners); -int nfnetlink_send(struct sk_buff *skb, u32 pid, unsigned group, int echo) +/* like nlmsg_notify, but we return the multicast error */ +int nfnetlink_notify(struct sk_buff *skb, u32 pid, unsigned group, int report) { - return nlmsg_notify(nfnl, skb, pid, group, echo, gfp_any()); + int err = 0, mcast_err = 0; + + if (group) { + int exclude_pid = 0; + + if (report) { + atomic_inc(&skb->users); + exclude_pid = pid; + } + + mcast_err = nlmsg_multicast(nfnl, skb, exclude_pid, + group, gfp_any()); + } + + if (report) + err = nlmsg_unicast(nfnl, skb, pid); + + return mcast_err ? mcast_err : err; } -EXPORT_SYMBOL_GPL(nfnetlink_send); +EXPORT_SYMBOL_GPL(nfnetlink_notify); int nfnetlink_unicast(struct sk_buff *skb, u_int32_t pid, int flags) {