On Tue, Mar 11, 2014 at 3:29 AM, Daniel Borkmann <dborkman@xxxxxxxxxx> wrote: > On 03/11/2014 10:19 AM, Pablo Neira Ayuso wrote: >> >> Hi! >> >> The following patchset provides a socket filtering alternative to BPF >> which allows you to define your filter using the nf_tables expressions. >> >> Similarly to BPF, you can attach filters via setsockopt() >> SO_ATTACH_NFT_FILTER. The filter that is passed to the kernel is >> expressed in netlink TLV format which looks like: >> >> expression list (nested attribute) >> expression element (nested attribute) >> expression name (string) >> expression data (nested attribute) >> ... specific attribute for this expression go here >> >> This is similar to the netlink format of the nf_tables rules, so we >> can re-use most of the infrastructure that we already have in userspace. >> The kernel takes the TLV representation and translates it to the native >> nf_tables representation. >> >> The patches 1-3 have helped to generalize the existing socket filtering >> infrastructure to allow pluging new socket filtering frameworks. Then, >> patches 4-8 generalize the nf_tables code by move the neccessary nf_tables >> expression and data initialization core infrastructure. Then, patch 9 >> provides the nf_tables socket filtering capabilities. >> >> Patrick and I have been discussing for a while that part of this >> generalisation works should also help to add support for providing a >> replacement to the tc framework, so with the necessary work, nf_tables >> may provide in the near future packet a single packet classification >> framework for Linux. > > > I'm being curious here ;) as there's currently an ongoing effort on > netdev for Alexei's eBPF engine (part 1 at [1,2,3]), which addresses > shortcomings of current BPF and shall long term entirely replace the > current BPF engine code to let filters entirely run in eBPF resp. > eBPF's JIT engine, as I understand, which is also transparently usable > in cls_bpf for classification in tc w/o rewriting on a different filter > language. Performance figures have been posted/provided in [1] as well. > > So the plan on your side would be to have an alternative to eBPF, or > build on top of it to reuse its in-kernel JIT compiler? > > [1] http://patchwork.ozlabs.org/patch/328927/ > [2] http://patchwork.ozlabs.org/patch/328926/ > [3] http://patchwork.ozlabs.org/patch/328928/ > > >> There is an example of the userspace code available at: >> >> http://people.netfilter.org/pablo/nft-sock-filter-test.c >> >> I'm currently reusing the existing libnftnl interfaces, my plan is to >> new interfaces in that library for easier and more simple filter >> definition for socket filtering. >> >> Note that the current nf_tables expression-set is also limited with >> regards to BPF, but the infrastructure that we have can be easily >> extended with new expressions. >> >> Comments welcome! Hi Pablo, Could you share what performance you're getting when doing nft filter equivalent to 'tcpdump port 22' ? Meaning your filter needs to parse eth->proto, ip or ipv6 header and check both ports. How will it compare with JITed bpf/ebpf ? I was trying to go the other way: improve nft performance with ebpf. 10/40G links are way to fast for interpreters. imo JIT is the only way. here are some comments about patches: 1/9: - if (fp->bpf_func != sk_run_filter) - module_free(NULL, fp->bpf_func); + if (fp->run_filter != sk_run_filter) + module_free(NULL, fp->run_filter); David suggested that these comparisons in all jits are ugly. I've fixed it in my patches. When they're in, you wouldn't need to mess with this. 2/9: - atomic_sub(sk_filter_size(fp->len), &sk->sk_omem_alloc); + atomic_sub(fp->size, &sk->sk_omem_alloc); that's a big change in socket memory accounting. We used to account for the whole sk_filter... now you're counting filter size only. Is it valid? 7/9: whole nft_expr_autoload() looks scary from security point of view. If I'm reading it correctly, the code will do request_module() based on userspace request to attach filter? 9/9: + case SO_NFT_GET_FILTER: + len = sk_nft_get_filter(sk, (struct sock_filter __user *)optval, len); with my patches there was a concern regarding socket checkpoint/restore and I had to preserve existing filter image to make sure it's not broken. Could you please coordinate with Pavel and co to test this piece? What will happen if nft_filter attached, but so_get_filter is called? crash? +static int nft_sock_expr_autoload(const struct nft_ctx *ctx, + const struct nlattr *nla) +{ +#ifdef CONFIG_MODULES + mutex_unlock(&nft_expr_info_mutex); + request_module("nft-expr-%.*s", nla_len(nla), (char *)nla_data(nla)); + mutex_lock(&nft_expr_info_mutex); same security concern here... +int sk_nft_attach_filter(char __user *optval, struct sock *sk) +{ what about sk_clone_lock()? since filter program is in nft, do you need to do special steps during copy of socket? + fp = sock_kmalloc(sk, sizeof(struct sk_filter) + size, GFP_KERNEL); this may allocate more memory then you need. Currently sk_filter_size() computes it in an accurate way. Also the same issue of optmem accounting as I mentioned in 2/9 +err4: + sock_kfree_s(sk, fp, size); a small bug: allocated sizeof(sk_filter)+size, but freeing 'size' only... Overall I think it's very interesting work. Not sure what's the use case for it though. I'll cook up a patch for the opposite approach (use ebpf inside nft) and will send you for review. I would prefer to work together to satisfy your and our user requests. Thanks Alexei >> Pablo Neira Ayuso (9): >> net: rename fp->bpf_func to fp->run_filter >> net: filter: account filter length in bytes >> net: filter: generalise sk_filter_release >> netfilter: nf_tables: move fast operations to header >> netfilter: nf_tables: add nft_value_init >> netfilter: nf_tables: rename nf_tables_core.c to nf_tables_nf.c >> netfilter: nf_tables: move expression infrastructure to built-in core >> netfilter: nf_tables: generalize verdict handling and introduce scopes >> netfilter: nf_tables: add support for socket filtering >> >> arch/arm/net/bpf_jit_32.c | 25 +- >> arch/powerpc/net/bpf_jit_comp.c | 10 +- >> arch/s390/net/bpf_jit_comp.c | 16 +- >> arch/sparc/net/bpf_jit_comp.c | 8 +- >> arch/x86/net/bpf_jit_comp.c | 8 +- >> include/linux/filter.h | 28 +- >> include/net/netfilter/nf_tables.h | 27 +- >> include/net/netfilter/nf_tables_core.h | 84 +++++ >> include/net/netfilter/nft_reject.h | 3 +- >> include/net/sock.h | 8 +- >> include/uapi/asm-generic/socket.h | 4 + >> net/core/filter.c | 28 +- >> net/core/sock.c | 19 ++ >> net/core/sock_diag.c | 4 +- >> net/netfilter/Kconfig | 13 + >> net/netfilter/Makefile | 9 +- >> net/netfilter/nf_tables_api.c | 440 ++++--------------------- >> net/netfilter/nf_tables_core.c | 564 >> +++++++++++++++++++++----------- >> net/netfilter/nf_tables_nf.c | 189 +++++++++++ >> net/netfilter/nf_tables_sock.c | 327 ++++++++++++++++++ >> net/netfilter/nft_bitwise.c | 35 +- >> net/netfilter/nft_byteorder.c | 28 +- >> net/netfilter/nft_cmp.c | 43 ++- >> net/netfilter/nft_compat.c | 6 +- >> net/netfilter/nft_counter.c | 3 +- >> net/netfilter/nft_ct.c | 9 +- >> net/netfilter/nft_exthdr.c | 3 +- >> net/netfilter/nft_hash.c | 12 +- >> net/netfilter/nft_immediate.c | 35 +- >> net/netfilter/nft_limit.c | 3 +- >> net/netfilter/nft_log.c | 3 +- >> net/netfilter/nft_lookup.c | 3 +- >> net/netfilter/nft_meta.c | 51 ++- >> net/netfilter/nft_nat.c | 3 +- >> net/netfilter/nft_payload.c | 29 +- >> net/netfilter/nft_queue.c | 3 +- >> net/netfilter/nft_rbtree.c | 12 +- >> net/netfilter/nft_reject.c | 3 +- >> 38 files changed, 1416 insertions(+), 682 deletions(-) >> create mode 100644 net/netfilter/nf_tables_nf.c >> create mode 100644 net/netfilter/nf_tables_sock.c >> > -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html