Re: [PATCH RFC 0/4] net: add bpfilter

Daniel Borkmann <daniel@xxxxxxxxxxxxx> · Mon, 19 Feb 2018 13:03:17 +0100

Hi Harald,

On 02/17/2018 01:11 PM, Harald Welte wrote:
[...]
>> As rule translation can potentially become very complex, this is performed
>> entirely in user space. In order to ease deployment, request_module() code
>> is extended to allow user mode helpers to be invoked. Idea is that user mode
>> helpers are built as part of the kernel build and installed as traditional
>> kernel modules with .ko file extension into distro specified location,
>> such that from a distribution point of view, they are no different than
>> regular kernel modules. 
> 
> That just blew my mind, sorry :)  This goes much beyond
> netfilter/iptables, and adds some quiet singificant new piece of
> kernel/userspace infrastructure.  To me, my apologies, it just sounds
> like a quite strange hack.  But then, I may lack the vision of how this
> might be useful in other contexts.

Thought was that it would be more suitable to push all the complexity of
such translation into user space which brings couple of additional advantages
as well: the translation can become very complex and thus it would contain
all of it behind syscall boundary where natural path of loading programs
would go via verifier. Given the tool would reside in user space, it would
also allow to ease development and testing can happen w/o recompiling the
kernel. It would allow for all the clang sanitizers to run there and for
having a comprehensive test suite to verify and dry test translations against
traffic test patterns (e.g. bpf infra would provide possibilities on this
w/o complex setup). Given normal user mode helpers make this rather painful
since they need to be shipped as extra package by the various distros, the
idea was that the module loader back end could treat umh similarly as kernel
modules and hook them in through request_module() approach while still
operating out of user space. In any case, I could image this approach might
be interesting and useful in general also for other subsystems requiring
umh in one way or another.

> I'm trying to understand why exactly one would
> * use a 18 year old iptables userspace program with its equally old
>   setsockopt based interface between kernel and userspace
> * insert an entire table with many chains of rules into the kernel
> * re-eject that ruleset into another userspace program which then
>   compiles it into an eBPF program
> * inserert that back into the kernel
> 
> To me, this looks like some kind of legacy backwards compatibility
> mechanism that one would find in proprietary operating systems, but not
> in Linux.  iptables, libiptc etc. are all free software.  The source
> code can be edited, and you could just as well have a new version of
> iptables and/or libiptc which would pass the ruleset in userspace to
> your compiler, which would then insert the resulting eBPF program.
> 
> You could even have a LD_PRELOAD wrapper doing the same.  That one
> would even work with direct users of the iptables setsockopt inteerface.
> 
> Why add quite comprehensive kerne infrastructure?  What's the motivation
> here?

Right, having a custom iptables, libiptc or LD_PRELOAD approach would work
as well of course, but it still wouldn't address applications that have
their own custom libs programmed against iptables uapi directly or those
that reused a builtin or modified libiptc directly in their application.
Such requests could only be covered transparently by having a small shim
layer in kernel and it also wouldn't require any extra packages from distro
side.

[...]
>> In the implemented proof of concept we show that simple /32 src/dst IPs
>> are translated in such manner. 
> 
> Of course this is the first that one starts with.  However, as we all
> know, iptables was never very good or efficient about 5-tuple matching.
> If you want a fast implementation of this, you don't use iptables which
> does linear list iteration.  The reason/rationale/use-case of iptables
> is its many (I believe more than 100 now?) extensions both on the area
> of matches and targets.
> 
> Some of those can be implemented easily in BPF (like recomputing the
> checksum or the like).   Some others I would find much more difficult -
> particularly if you want to off-load it to the NIC.  They require access
> to state that only the kernel has (like 'cgroup' or 'owner' matching).

Yeah, when it comes to offloading, the latter two examples are heavily tied
to upper layers of the (local) stack, so for cases like those it wouldn't
make much sense, but e.g. matches, mangling or forwarding based on packet
data are obvious candidates that can already be offloaded today in a
flexible and programmable manner all with existing BPF infra, so for those
it could definitely be highly interesting to make use of it.

>> In the below example, we show that dumping, loading and offloading of
>> one or multiple simple rules work, we show the bpftool XDP dump of the
>> generated BPF instruction sequence as well as a simple functional ping
>> test to enforce policy in such way.
> 
> Could you please clarify why the 'filter' table INPUT chain was used if
> you're using XDP?  AFAICT they have completely different semantics.
> 
> There is a well-conceived and generally understood notion of where
> exactly the filter/INPUT table processing happens.  And that's not as
> early as in the NIC, but it's much later in the processing of the
> packet.
> 
> I believe _if_ one wants to use the approach of "hiding" eBPF behind
> iptables, then either
> 
> a) the eBPF programs must be executed at the exact same points in the
>    stack as the existing hooks of the built-in chains of the
>    filter/nat/mangle/raw tables, or
> 
> b) you must introduce new 'tables', like an 'xdp' table which then has
>    the notion of processing very early in processing, way before the
>    normal filter table INPUT processing happens.

The semantics would need to match the exact same points fully agree, so
wrt hooks the ingress one would have been a more close candidate for the
provided example.

With regards to why iptables and not nftables, definitely a legit question.
Right now most of the infra still relies on iptables in one way or another,
is there an answer for the existing applications that programmed against
uapi directly for the long term?

Meaning once the switch would be flipped to nftables entirely, would they
stop working when old iptables gets removed?

When would be the ETA for that?

Not opposed at all for some sort of translation for the latter, just wondering
since for the majority it would probably still take years to come to make
a migration.

Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html