Hi Daniel, On Mon, Feb 19, 2018 at 01:03:17PM +0100, Daniel Borkmann wrote: > Hi Harald, > > On 02/17/2018 01:11 PM, Harald Welte wrote: > [...] > >> As rule translation can potentially become very complex, this is performed > >> entirely in user space. In order to ease deployment, request_module() code > >> is extended to allow user mode helpers to be invoked. Idea is that user mode > >> helpers are built as part of the kernel build and installed as traditional > >> kernel modules with .ko file extension into distro specified location, > >> such that from a distribution point of view, they are no different than > >> regular kernel modules. > > > > That just blew my mind, sorry :) This goes much beyond > > netfilter/iptables, and adds some quiet singificant new piece of > > kernel/userspace infrastructure. To me, my apologies, it just sounds > > like a quite strange hack. But then, I may lack the vision of how this > > might be useful in other contexts. > > Thought was that it would be more suitable to push all the complexity of > such translation into user space [...] Sure, you have no complaints from my side about that goal. I'm just not sure if turning the kernel module loader into a new mechanism to start userspace processes is. I guess that's a question that the people involved with core kernel code and module loader have to answer. To me it seems like a very loooong detour away from the actual topic (packet filtering). > Given normal user mode helpers make this rather painful since they > need to be shipped as extra package by the various distros, the idea > was that the module loader back end could treat umh similarly as > kernel modules and hook them in through request_module() approach > while still operating out of user space. In any case, I could image > this approach might be interesting and useful in general also for > other subsystems requiring umh in one way or another. I completely agree this approach has some logic to it. I just think the approach taken is *very* different from what has been traditionally done in the Linux world. All sorts of userspace programs to configure kernel features (iptables being one of them iproute2, etc.) have always been distributed as separate/independent application programs, which are packaged separately, etc. Making the kernel source tree build such userspace utilities and executing them in a new fashion via the kernel module loaders are to me two quite large conceptual changes on how "Linux works", and I believe you will have to "sell" this idea to many people outside the kernel networking communit, i.e. core kernel developers, people who do packaging, etc. I'm not saying I'm fundamentally opposed to it. Will be curious to see how the wider kernel community thinks of that architecture. > Right, having a custom iptables, libiptc or LD_PRELOAD approach would work > as well of course, but it still wouldn't address applications that have > their own custom libs programmed against iptables uapi directly or those > that reused a builtin or modified libiptc directly in their application. How many of those wide-spread applications are you aware of? The two projects you have pointed out (docker and kubernetes) don't. As the assumption that many such tools would need to be supported drives a lot of the design decisions, I would argue one needs a solid empircal basis. Also, the LD_PRELOAD wrapper *would* work with all those programs. Only the iptables command line replacement wouldn't catch those. > Such requests could only be covered transparently by having a small shim > layer in kernel and it also wouldn't require any extra packages from distro > side. What is wrong with extra packages in distributions? Distributions also will have to update the kernel to include your new code, so they could at the same time use a new iptables (or $whatever) package. This is true for virtually all new kernel features. Your userland needs to go along with it, if it wants to use those new features. > > Some of those can be implemented easily in BPF (like recomputing the > > checksum or the like). Some others I would find much more difficult - > > particularly if you want to off-load it to the NIC. They require access > > to state that only the kernel has (like 'cgroup' or 'owner' matching). > > Yeah, when it comes to offloading, the latter two examples are heavily tied > to upper layers of the (local) stack, so for cases like those it wouldn't > make much sense, but e.g. matches, mangling or forwarding based on packet > data are obvious candidates that can already be offloaded today in a > flexible and programmable manner all with existing BPF infra, so for those > it could definitely be highly interesting to make use of it. While I believe you there are many ways how one can offload things flexibly with eBPF, I still have a hard time understanding how you want to merge this with the existing well-defined notion of when exactly a given chain of a given table is executed. * compatibility with iptables only makes sense if you use the legacy filter/nat/raw/mangle tables and the existing points in the stack, such as PRE_ROUTING/POST_ROUTING/LOCAL_IN/LOCAL_OUT * offloading those to a NIC seems rather hard to me, as you basically have already traversed half of the Linux kernel stack and then suddenly need to feed it back into your smart-nic, only to pull it back from there to continue traversing the Linux stack. Forgive me if I'm not following recent developments closely enough if that problem has already been solved. * as soon as you define new 'netfilter hooks' to which the ip/ip6/arp/... tables can bind, you can of course define such new tables with new built-in chains. However, at that point I'm starting to wonder why you would then want to go for "iptables compatibility" in the first place, as no existing rulesets / programs can be used 1:1 as they need to design their rulesets based on those new tables/hooks/chains So conceptually, I think if you want semantic backwards compatibility, all you can do is to translate all rules attached to a given netfilter hook into a eBPF program, and then execute that program instead of the normal ip_tables traversal. Adn that eBPF would then run on the normal host CPU. > The semantics would need to match the exact same points fully agree, so > wrt hooks the ingress one would have been a more close candidate for the > provided example. > > With regards to why iptables and not nftables, definitely a legit question. > Right now most of the infra still relies on iptables in one way or another, > is there an answer for the existing applications that programmed against > uapi directly for the long term? As Pablo and Florian have commented, there are legacy translators/wrappers that "look and feel like iptables" but in tern translate the rules to nftables and then use the nftables netlink interface to manage the rules in the kernel. Is it a perfect 100.0% translation? No. How close it is and what's missing is something they have to comment on. > Meaning once the switch would be flipped to nftables entirely, would they > stop working when old iptables gets removed? The various translation/migration scenarios are described here https://wiki.nftables.org/wiki-nftables/index.php/Moving_from_iptables_to_nftables > When would be the ETA for that? I cannot comment on this. The current netfilter core team will have to respond to this. > Not opposed at all for some sort of translation for the latter, just wondering > since for the majority it would probably still take years to come to make > a migration. Migrations take some time, sure. It will also take years if such a iptables-to-ebpf translator development gets completed, included in mainline, distributions enable it, roll out such kernel versions, ... In general, I would argue that compatibility on the command line level of {ip,ip6,arp}{tables,tables-{save,restore}} is the most important level of compatibility. This is what has been done at previous migrations e.g. ipchains->iptables before. Translators already exist for *tables -> nftables. (possibly slightly dated) Status can be found at https://wiki.nftables.org/wiki-nftables/index.php/List_of_available_translations_via_iptables-translate_tool and https://wiki.nftables.org/wiki-nftables/index.php/Supported_features_compared_to_xtables It would be an interesting test to see if e.g. docker would run on top of the translator. I have no idea if anyone has tried this. It would for sure be an interesting investigation. I would much rather see effort spent on improving the existing translators, or helping those projects doing the switch to nftables (or any other new technology) than to introduce new technology using 18-year-old uapi interfaces that we all know have many problems. Regards, Harald -- - Harald Welte <laforge@xxxxxxxxxxxx> http://laforge.gnumonks.org/ ============================================================================ "Privacy in residential applications is a desirable marketing option." (ETSI EN 300 175-7 Ch. A6) -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html