Re: [PATCH RFC 0/4] net: add bpfilter

Harald Welte <laforge@xxxxxxxxxxxx> · Mon, 19 Feb 2018 13:52:18 +0100

Hi Daniel,

On Mon, Feb 19, 2018 at 01:03:17PM +0100, Daniel Borkmann wrote:
> Hi Harald,
> 
> On 02/17/2018 01:11 PM, Harald Welte wrote:
> [...]
> >> As rule translation can potentially become very complex, this is performed
> >> entirely in user space. In order to ease deployment, request_module() code
> >> is extended to allow user mode helpers to be invoked. Idea is that user mode
> >> helpers are built as part of the kernel build and installed as traditional
> >> kernel modules with .ko file extension into distro specified location,
> >> such that from a distribution point of view, they are no different than
> >> regular kernel modules. 
> > 
> > That just blew my mind, sorry :)  This goes much beyond
> > netfilter/iptables, and adds some quiet singificant new piece of
> > kernel/userspace infrastructure.  To me, my apologies, it just sounds
> > like a quite strange hack.  But then, I may lack the vision of how this
> > might be useful in other contexts.
> 
> Thought was that it would be more suitable to push all the complexity of
> such translation into user space [...]

Sure, you have no complaints from my side about that goal.  I'm just not
sure if turning the kernel module loader into a new mechanism to start
userspace processes is.  I guess that's a question that the people
involved with core kernel code and module loader have to answer.  To me
it seems like a very loooong detour away from the actual topic (packet
filtering).

> Given normal user mode helpers make this rather painful since they
> need to be shipped as extra package by the various distros, the idea
> was that the module loader back end could treat umh similarly as
> kernel modules and hook them in through request_module() approach
> while still operating out of user space. In any case, I could image
> this approach might be interesting and useful in general also for
> other subsystems requiring umh in one way or another.

I completely agree this approach has some logic to it.  I just think the
approach taken is *very* different from what has been traditionally done
in the Linux world.  All sorts of userspace programs to configure kernel
features (iptables being one of them iproute2, etc.) have always been
distributed as separate/independent application programs, which are
packaged separately, etc.

Making the kernel source tree build such userspace utilities and
executing them in a new fashion via the kernel module loaders are to me
two quite large conceptual changes on how "Linux works", and I believe
you will have to "sell" this idea to many people outside the kernel
networking communit, i.e. core kernel developers, people who do
packaging, etc.

I'm not saying I'm fundamentally opposed to it.  Will be curious to see
how the wider kernel community thinks of that architecture.

> Right, having a custom iptables, libiptc or LD_PRELOAD approach would work
> as well of course, but it still wouldn't address applications that have
> their own custom libs programmed against iptables uapi directly or those
> that reused a builtin or modified libiptc directly in their application.

How many of those wide-spread applications are you aware of?  The two
projects you have pointed out (docker and kubernetes) don't. As the
assumption that many such tools would need to be supported drives a lot
of the design decisions, I would argue one needs a solid empircal basis.

Also, the LD_PRELOAD wrapper *would* work with all those programs.  Only
the iptables command line replacement wouldn't catch those.

> Such requests could only be covered transparently by having a small shim
> layer in kernel and it also wouldn't require any extra packages from distro
> side.

What is wrong with extra packages in distributions?  Distributions also
will have to update the kernel to include your new code, so they could
at the same time use a new iptables (or $whatever) package.  This is
true for virtually all new kernel features.  Your userland needs to go
along with it, if it wants to use those new features.

> > Some of those can be implemented easily in BPF (like recomputing the
> > checksum or the like).   Some others I would find much more difficult -
> > particularly if you want to off-load it to the NIC.  They require access
> > to state that only the kernel has (like 'cgroup' or 'owner' matching).
> 
> Yeah, when it comes to offloading, the latter two examples are heavily tied
> to upper layers of the (local) stack, so for cases like those it wouldn't
> make much sense, but e.g. matches, mangling or forwarding based on packet
> data are obvious candidates that can already be offloaded today in a
> flexible and programmable manner all with existing BPF infra, so for those
> it could definitely be highly interesting to make use of it.

While I believe you there are many ways how one can offload things
flexibly with eBPF, I still have a hard time understanding how you want
to merge this with the existing well-defined notion of when exactly a
given chain of a given table is executed.

* compatibility with iptables only makes sense if you use the legacy
  filter/nat/raw/mangle tables and the existing points in the stack,
  such as PRE_ROUTING/POST_ROUTING/LOCAL_IN/LOCAL_OUT

* offloading those to a NIC seems rather hard to me, as you basically
  have already traversed half of the Linux kernel stack and then
  suddenly need to feed it back into your smart-nic, only to pull it
  back from there to continue traversing the Linux stack.  Forgive me if
  I'm not following recent developments closely enough if that problem
  has already been solved.

* as soon as you define new 'netfilter hooks' to which the
  ip/ip6/arp/... tables can bind, you can of course define such new
  tables with new built-in chains.  However, at that point I'm starting
  to wonder why you would then want to go for "iptables compatibility"
  in the first place, as no existing rulesets / programs can be used 1:1
  as they need to design their rulesets based on those new
  tables/hooks/chains

So conceptually, I think if you want semantic backwards compatibility,
all you can do is to translate all rules attached to a given netfilter
hook into a eBPF program, and then execute that program instead of the
normal ip_tables traversal.  Adn that eBPF would then run on the normal
host CPU.

> The semantics would need to match the exact same points fully agree, so
> wrt hooks the ingress one would have been a more close candidate for the
> provided example.
> 

> With regards to why iptables and not nftables, definitely a legit question.
> Right now most of the infra still relies on iptables in one way or another,
> is there an answer for the existing applications that programmed against
> uapi directly for the long term?

As Pablo and Florian have commented, there are legacy
translators/wrappers that "look and feel like iptables" but in tern
translate the rules to nftables and then use the nftables netlink
interface to manage the rules in the kernel.  Is it a perfect 100.0%
translation? No.  How close it is and what's missing is something they
have to comment on.

> Meaning once the switch would be flipped to nftables entirely, would they
> stop working when old iptables gets removed?

The various translation/migration scenarios are described here
https://wiki.nftables.org/wiki-nftables/index.php/Moving_from_iptables_to_nftables

> When would be the ETA for that?

I cannot comment on this. The current netfilter core team will have to
respond to this.

> Not opposed at all for some sort of translation for the latter, just wondering
> since for the majority it would probably still take years to come to make
> a migration.

Migrations take some time, sure.  It will also take years if such a
iptables-to-ebpf translator development gets completed, included in
mainline, distributions enable it, roll out such kernel versions, ...

In general, I would argue that compatibility on the command line level
of {ip,ip6,arp}{tables,tables-{save,restore}} is the most important
level of compatibility.  This is what has been done at previous
migrations e.g. ipchains->iptables before.

Translators already exist for *tables -> nftables. (possibly slightly dated) Status can be found at 
https://wiki.nftables.org/wiki-nftables/index.php/List_of_available_translations_via_iptables-translate_tool
and
https://wiki.nftables.org/wiki-nftables/index.php/Supported_features_compared_to_xtables

It would be an interesting test to see if e.g. docker would run on top
of the translator.  I have no idea if anyone has tried this.  It would
for sure be an interesting investigation.  I would much rather see
effort spent on improving the existing translators, or helping those
projects doing the switch to nftables (or any other new technology) than
to introduce new technology using 18-year-old uapi interfaces that we
all know have many problems.

Regards,
	Harald
-- 
- Harald Welte <laforge@xxxxxxxxxxxx>           http://laforge.gnumonks.org/
============================================================================
"Privacy in residential applications is a desirable marketing option."
                                                  (ETSI EN 300 175-7 Ch. A6)
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html