On Fri, Sep 16, 2016 at 12:57:29PM -0700, Sargun Dhillon wrote: > On Wed, Sep 14, 2016 at 01:13:16PM +0200, Daniel Mack wrote: > > Hi Pablo, > > > > On 09/13/2016 07:24 PM, Pablo Neira Ayuso wrote: > > > On Tue, Sep 13, 2016 at 03:31:20PM +0200, Daniel Mack wrote: > > >> On 09/13/2016 01:56 PM, Pablo Neira Ayuso wrote: > > >>> On Mon, Sep 12, 2016 at 06:12:09PM +0200, Daniel Mack wrote: > > >>>> This is v5 of the patch set to allow eBPF programs for network > > >>>> filtering and accounting to be attached to cgroups, so that they apply > > >>>> to all sockets of all tasks placed in that cgroup. The logic also > > >>>> allows to be extendeded for other cgroup based eBPF logic. > > >>> > > >>> 1) This infrastructure can only be useful to systemd, or any similar > > >>> orchestration daemon. Look, you can only apply filtering policies > > >>> to processes that are launched by systemd, so this only works > > >>> for server processes. > > >> > > >> Sorry, but both statements aren't true. The eBPF policies apply to every > > >> process that is placed in a cgroup, and my example program in 6/6 shows > > >> how that can be done from the command line. > > > > > > Then you have to explain me how can anyone else than systemd use this > > > infrastructure? > > > > I have no idea what makes you think this is limited to systemd. As I > > said, I provided an example for userspace that works from the command > > line. The same limitation apply as for all other users of cgroups. > > > So, at least in my work, we have Mesos, but on nearly every machine that Mesos > runs, people also have systemd. Now, there's recently become a bit of a battle > of ownership of things like cgroups on these machines. We can usually solve it > by nesting under systemd cgroups, and thus so far we've avoided making too many > systemd-specific concessions. > > The reason this works (mostly), is because everything we touch has a sense of > nesting, where we can apply policy at a place lower in the hierarchy, and yet > systemd's monitoring and policy still stays in place. > > Now, with this patch, we don't have that, but I think we can reasonably add some > flag like "no override" when applying policies, or alternatively something like > "no new privileges", to prevent children from applying policies that override > top-level policy. I realize there is a speed concern as well, but I think for > people who want nested policy, we're willing to make the tradeoff. The cost > of traversing a few extra pointers still outweighs the overhead of network > namespaces, iptables, etc.. for many of us. > > What do you think Daniel? > > > > My main point is that those processes *need* to be launched by the > > > orchestrator, which is was refering as 'server processes'. > > > > Yes, that's right. But as I said, this rule applies to many other kernel > > concepts, so I don't see any real issue. > > > Also, cgroups have become such a big part of how applications are managed > that many of us have solved this problem. > > > >> That's a limitation that applies to many more control mechanisms in the > > >> kernel, and it's something that can easily be solved with fork+exec. > > > > > > As long as you have control to launch the processes yes, but this > > > will not work in other scenarios. Just like cgroup net_cls and friends > > > are broken for filtering for things that you have no control to > > > fork+exec. > > > > Probably, but that's only solvable with rules that store the full cgroup > > path then, and do a string comparison (!) for each packet flying by. > > > > >> That's just as transparent as SO_ATTACH_FILTER. What kind of > > >> introspection mechanism do you have in mind? > > > > > > SO_ATTACH_FILTER is called from the process itself, so this is a local > > > filtering policy that you apply to your own process. > > > > Not necessarily. You can as well do it the inetd way, and pass the > > socket to a process that is launched on demand, but do SO_ATTACH_FILTER > > + SO_LOCK_FILTER in the middle. What happens with payload on the socket > > is not transparent to the launched binary at all. The proposed cgroup > > eBPF solution implements a very similar behavior in that regard. > > > It would be nice to be able to see whether or not a filter is attached to a > cgroup, but given this is going through syscalls, at least introspection > is possible as opposed to something like netlink. > > > >> It's about filtering outgoing network packets of applications, and > > >> providing them with L2 information for filtering purposes. I don't think > > >> that's a very specific use-case. > > >> > > >> When the feature is not used at all, the added costs on the output path > > >> are close to zero, due to the use of static branches. > > > > > > *You're proposing a socket filtering facility that hooks layer 2 > > > output path*! > > > > As I said, I'm open to discussing that. In order to make it work for L3, > > the LL_OFF issues need to be solved, as Daniel explained. Daniel, > > Alexei, any idea how much work that would be? > > > > > That is only a rough ~30 lines kernel patchset to support this in > > > netfilter and only one extra input hook, with potential access to > > > conntrack and better integration with other existing subsystems. > > > > Care to share the patches for that? I'd really like to have a look. > > > > And FWIW, I agree with Thomas - there is nothing wrong with having > > multiple options to use for such use-cases. > Right now, for containers, we have netfilter and network namespaces. > There's a lot of performance overhead that comes with this. Not only > that, but iptables doesn't really have a simple way of usage by > automated infrastructure. We (firewalld, systemd, dockerd, mesos) > end up fighting with one another for ownership over firewall rules. > > Although, I have problems with this approach, I think that it's > a good baseline where we can have top level owned by systemd, > docker underneath that, and Mesos underneath that. We can add > additional hooks for things like Checmate and Landlock, and > with a little more work, we can do compositition, solving > all of our problems. > > > > > > > Thanks, > > Daniel > > Another thing -- It probably makes sense to make the warning in cgroup.c highlight the fact that it disables these filters as well. Perhaps, it makes sense to make it so you can't disable it (boot flag, say?). Alternatively, maybe it makes sense to introduce some exclusivity? So, that when you load a filter, it disables net_cls, and when you load net_cls, it throws warnings. -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html