Hi Harald, On 02/17/2018 01:11 PM, Harald Welte wrote: [...] >> As rule translation can potentially become very complex, this is performed >> entirely in user space. In order to ease deployment, request_module() code >> is extended to allow user mode helpers to be invoked. Idea is that user mode >> helpers are built as part of the kernel build and installed as traditional >> kernel modules with .ko file extension into distro specified location, >> such that from a distribution point of view, they are no different than >> regular kernel modules. > > That just blew my mind, sorry :) This goes much beyond > netfilter/iptables, and adds some quiet singificant new piece of > kernel/userspace infrastructure. To me, my apologies, it just sounds > like a quite strange hack. But then, I may lack the vision of how this > might be useful in other contexts. Thought was that it would be more suitable to push all the complexity of such translation into user space which brings couple of additional advantages as well: the translation can become very complex and thus it would contain all of it behind syscall boundary where natural path of loading programs would go via verifier. Given the tool would reside in user space, it would also allow to ease development and testing can happen w/o recompiling the kernel. It would allow for all the clang sanitizers to run there and for having a comprehensive test suite to verify and dry test translations against traffic test patterns (e.g. bpf infra would provide possibilities on this w/o complex setup). Given normal user mode helpers make this rather painful since they need to be shipped as extra package by the various distros, the idea was that the module loader back end could treat umh similarly as kernel modules and hook them in through request_module() approach while still operating out of user space. In any case, I could image this approach might be interesting and useful in general also for other subsystems requiring umh in one way or another. > I'm trying to understand why exactly one would > * use a 18 year old iptables userspace program with its equally old > setsockopt based interface between kernel and userspace > * insert an entire table with many chains of rules into the kernel > * re-eject that ruleset into another userspace program which then > compiles it into an eBPF program > * inserert that back into the kernel > > To me, this looks like some kind of legacy backwards compatibility > mechanism that one would find in proprietary operating systems, but not > in Linux. iptables, libiptc etc. are all free software. The source > code can be edited, and you could just as well have a new version of > iptables and/or libiptc which would pass the ruleset in userspace to > your compiler, which would then insert the resulting eBPF program. > > You could even have a LD_PRELOAD wrapper doing the same. That one > would even work with direct users of the iptables setsockopt inteerface. > > Why add quite comprehensive kerne infrastructure? What's the motivation > here? Right, having a custom iptables, libiptc or LD_PRELOAD approach would work as well of course, but it still wouldn't address applications that have their own custom libs programmed against iptables uapi directly or those that reused a builtin or modified libiptc directly in their application. Such requests could only be covered transparently by having a small shim layer in kernel and it also wouldn't require any extra packages from distro side. [...] >> In the implemented proof of concept we show that simple /32 src/dst IPs >> are translated in such manner. > > Of course this is the first that one starts with. However, as we all > know, iptables was never very good or efficient about 5-tuple matching. > If you want a fast implementation of this, you don't use iptables which > does linear list iteration. The reason/rationale/use-case of iptables > is its many (I believe more than 100 now?) extensions both on the area > of matches and targets. > > Some of those can be implemented easily in BPF (like recomputing the > checksum or the like). Some others I would find much more difficult - > particularly if you want to off-load it to the NIC. They require access > to state that only the kernel has (like 'cgroup' or 'owner' matching). Yeah, when it comes to offloading, the latter two examples are heavily tied to upper layers of the (local) stack, so for cases like those it wouldn't make much sense, but e.g. matches, mangling or forwarding based on packet data are obvious candidates that can already be offloaded today in a flexible and programmable manner all with existing BPF infra, so for those it could definitely be highly interesting to make use of it. >> In the below example, we show that dumping, loading and offloading of >> one or multiple simple rules work, we show the bpftool XDP dump of the >> generated BPF instruction sequence as well as a simple functional ping >> test to enforce policy in such way. > > Could you please clarify why the 'filter' table INPUT chain was used if > you're using XDP? AFAICT they have completely different semantics. > > There is a well-conceived and generally understood notion of where > exactly the filter/INPUT table processing happens. And that's not as > early as in the NIC, but it's much later in the processing of the > packet. > > I believe _if_ one wants to use the approach of "hiding" eBPF behind > iptables, then either > > a) the eBPF programs must be executed at the exact same points in the > stack as the existing hooks of the built-in chains of the > filter/nat/mangle/raw tables, or > > b) you must introduce new 'tables', like an 'xdp' table which then has > the notion of processing very early in processing, way before the > normal filter table INPUT processing happens. The semantics would need to match the exact same points fully agree, so wrt hooks the ingress one would have been a more close candidate for the provided example. With regards to why iptables and not nftables, definitely a legit question. Right now most of the infra still relies on iptables in one way or another, is there an answer for the existing applications that programmed against uapi directly for the long term? Meaning once the switch would be flipped to nftables entirely, would they stop working when old iptables gets removed? When would be the ETA for that? Not opposed at all for some sort of translation for the latter, just wondering since for the majority it would probably still take years to come to make a migration. Thanks, Daniel -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html