Re: [PATCH RFC 0/4] net: add bpfilter

Daniel Borkmann <daniel@xxxxxxxxxxxxx> · Tue, 20 Feb 2018 15:07:55 +0100

On 02/20/2018 11:44 AM, Pablo Neira Ayuso wrote:
> Hi David!
> 
> On Mon, Feb 19, 2018 at 12:22:26PM -0500, David Miller wrote:
> [...]
>> Netfilter's chronic performance differential is why a lot of mindshare
>> was lost to userspace networking technologies.
> 
> Claiming that Netfilter is the reason for the massive adoption of
> userspace networking isn't a fair statement at all.
> 
> Let's talk about performance if this is what you want:
> 
> * Our benchmarks here are delivering ~x9.5 performance boost for IPv4
>   load balancing from netfilter ingress.
> 
> * ~x2 faster than iptables prerouting when dropping packets at very
>   early stage in the network datapath - dos attack scenario - again from
>   the ingress hook.
> 
> * The new flowtable infrastructure that will show up in 4.16 provides
>   a faster forwarding path, measuring ~x2 faster forwarding here, _by
>   simply adding one single rule to your FORWARD chain_. And that's
>   just the initial implementation that got merged upstream, we have
>   room to fly even faster.
> 
> And that's just the beginning, we have more ongoing work, incrementally
> based on top of what we have, to provide even faster datapath paths with
> very simple configurations.
> 
> Note that those numbers above are very similar numbers to what we have
> seen in bpf.  Well, to be honest, we're just slightly behind bpf, since
> benchmarks I have seen on loading balancing IPv4 is x10 from XDP,
> dropping packets also slightly more than x2, which is actually happening
> way earlier than ingress, naturally dropping earlier gives us better
> numbers.
> 
> But it's not all about performance... let's have a look at the "iron
> triangle"...
> 
> We keep usability in our radar, that's paramount for us. Netfilter is
> probably so much widely more adopted than tc because of this. The kind

Right, in terms of performance the above is what tc ingress used to do
already long ago after spinlock removal could be lifted, which was an
important step on that direction. In terms of usability, sure, it's always
a 'fun' topic on that matter for a number of classifier / actions mostly
from the older days. I think there it has improved a bit over time,
but at least speaking of things like cls_bpf, it's trivial to attach an
object somewhere via tc cmdline.

> of problems that big Silicon datacenters have to face are simply
> different to the millions of devices running Linux outthere, there are
> plenty of smart devops outthere that sacrify the little performance loss
> at the cost of keeping it easy to configure and maintain things.
> 
> If we want to talk about problems...
> 
> Every project has its own subset of problems. In that sense, anyone that
> has spent time playing with the bpf infrastructure is very much aware of
> all of its usability problems:
> 
> * You have to disable optimizations in llvm, otherwise the verifier
>   gets confused too smart compiler optimizations and rejects the code.

That is actually a false claim, which makes me think that you didn't even
give this a try at all before stating the above. Funny enough, for a very
long period of time in LLVM's BPF back end when you used other optimization
levels than the -O2, clang would bark with an internal error, for example:

  $ clang-3.9 -target bpf -O0 -c foo.c -o /tmp/foo.o
  fatal error: error in backend: Cannot select: 0x5633ae698280: ch,glue = BPFISD::CALL 0x5633ae698210, 0x5633ae697e90, Register:i64 %R1, Register:i64 %R2, Register:i64 %R3,
      0x5633ae698210:1
     0x5633ae697e90: i64,ch = load<LD8[@tail_call]> 0x5633ae6955e0, 0x5633ae694fc0, undef:i64
      0x5633ae694fc0: i64 = BPFISD::Wrapper TargetGlobalAddress:i64<void (%struct.__sk_buff*, i8*, i32)** @tail_call> 0
  [...]

Whereas -O2 *is* the general recommendation for everyone to use:

  $ clang-3.9 -target bpf -O2 -c foo.c -o /tmp/foo.o
  $

This is fixed in later versions, e.g. in clang-7.0 such back end error is
gone anyway fwiw. But in any case, we're running complex programs with -O2
optimization levels for several years now just fine. Yes, given we do push
BPF to the limits we had some corner cases where the verifier had to be
adjusted, but overall the number of cases reduced over time, which is also
a natural progression when people use it in various advanced ways. In fact,
it's a much better choice to use clang with -O2 here since simply the majority
of people use it that way. And if you consume it via higher level front ends
e.g. bcc, ply, bpftrace to name a few from tracing side, then you don't need
to care at all about this. (But in addition to that, there's also continuous
effort on LLVM side to optimize BPF code generation in various ways.)

> * Very hard to debug the reason why the verifier is rejecting apparently
>   valid code. That results in people playing strange "moving code around
>   up and down".

Please show me your programs and I'm happy to help you out. :-) Yes, in the
earlier days, I would consider it might have been hard; during the course
of the last few years, the verifier and LLVM back end have seen both heavy
improvements all over the place e.g. llvm-objdump correlating verifier errors
back to the pseudo C code via dwarf was a bigger one on the latter for example.
Writing BPF programs definitely became easier although there's always undoubted
room for improvement and the work we're heading towards will make it more
natural to develop programs in the C front end it provides, reducing further
potential contention with the verifier. It takes a bit to get used to the
verifier analysis, but then there's always a learning curve for getting into
new frameworks and develop a basic understanding for their semantics. Same
holds true when people would switch from using their known ip*tables-translate
syntax to using nft directly.

Anyway, aside from this, for BPF we also have the case that the people who
develop programs to solve problems with the help of this technology are just
a small subset of the ones that are using it. Best example is probably, as
mentioned some time ago in the thread, the hard work from Brendan Gregg and
many others in bcc to develop all the really easy to consume tracing cmdline
tools.

> * Lack of sufficient abstraction: bpf is not only exposing its own
>   software bugs through its interface, but it will also bite the dust
>   with CPU bugs due to lack of glue code to hide details behind the
>   syscall interface curtain.  That will need a kernel upgrade after all to
>   fix, so all benefits of adding new programs. We've even seem claims on
>   performance being more important than security in this mailing list.
>   Don't get me wrong, no software is safe from security issues, but if you
>   don't abstract your resources in the right way, you have more chance to
>   have experimence more problems.

Sorry, but this is just nebulous FUD here. Yes, every software has bugs. If
there are bugs, we handle them and fix it, period. So? You've probably seen
the extensive kernel selftest suite we have developed over time that by now
contains more than over 1k test cases on the BPF core infrastructure and many
more to come. Quite frankly, I'm actually very happy on the progress from
the syzkaller folks in recent months as well to stress BPF continuously, and
it finds bugs just as well in other areas (like netfilter), so yeah, we all
do keep our heads down and fix them properly in order to make everything more
robust.

> Just to mention a few of them.
> 
> So, please, let's focus each of us in our own work. Let me remind your
> wise words - I think just one year ago in another of these episodes of
> the bpf vs. netfilter: "We're all working to achieve the same goals",
> even if we're working on competing projects inside Linux.
> 
> Thanks!
> 

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html