Re: [PATCH v5 0/6] Add eBPF hooks for cgroups

Daniel Borkmann <daniel@xxxxxxxxxxxxx> · Tue, 13 Sep 2016 16:14:34 +0200

On 09/13/2016 03:31 PM, Daniel Mack wrote:
On 09/13/2016 01:56 PM, Pablo Neira Ayuso wrote:
On Mon, Sep 12, 2016 at 06:12:09PM +0200, Daniel Mack wrote:
This is v5 of the patch set to allow eBPF programs for network
filtering and accounting to be attached to cgroups, so that they apply
to all sockets of all tasks placed in that cgroup. The logic also
allows to be extendeded for other cgroup based eBPF logic.

1) This infrastructure can only be useful to systemd, or any similar
    orchestration daemon. Look, you can only apply filtering policies
    to processes that are launched by systemd, so this only works
    for server processes.

Sorry, but both statements aren't true. The eBPF policies apply to every
process that is placed in a cgroup, and my example program in 6/6 shows
how that can be done from the command line. Also, systemd is able to
control userspace processes just fine, and it not limited to 'server
processes'.

For client processes this infrastructure is
    *racy*, you have to add new processes in runtime to the cgroup,
    thus there will be time some little time where no filtering policy
    will be applied. For quality of service, this may be an acceptable
    race, but this is aiming to deploy a filtering policy.

That's a limitation that applies to many more control mechanisms in the
kernel, and it's something that can easily be solved with fork+exec.

2) This aproach looks uninfrastructured to me. This provides a hook
    to push a bpf blob at a place in the stack that deploys a filtering
    policy that is not visible to others.

That's just as transparent as SO_ATTACH_FILTER. What kind of
introspection mechanism do you have in mind?

We have interfaces that allows
    us to dump the filtering policy that is being applied, report events
    to enable cooperation between several processes with similar
    capabilities and so on.

Well, in practice, for netfilter, there can only be one instance in the
system that acts as central authoritative, otherwise you'll end up with
orphaned entries or with situation where some client deletes rules
behind the back of the one that originally installed it. So I really
think there is nothing wrong with demanding a single, privileged
controller to manage things.

After chatting with Daniel Borkmann and Alexei off-list, we concluded
that __dev_queue_xmit() is the place where the egress hooks should live
when eBPF programs need access to the L2 bits of the skb.

3) This egress hook is coming very late, the only reason I find to
    place it at __dev_queue_xmit() is that bpf naturally works with
    layer 2 information in place. But this new hook is placed in
    _everyone's output ath_ that only works for the very specific
    usecase I exposed above.

It's about filtering outgoing network packets of applications, and
providing them with L2 information for filtering purposes. I don't think
that's a very specific use-case.

When the feature is not used at all, the added costs on the output path
are close to zero, due to the use of static branches. If used somewhere
in the system but not for the packet in flight, costs are slightly
higher but acceptable. In fact, it's not even measurable in my tests
here. How is that different from the netfilter OUTPUT hook, btw?

That said, limiting it to L3 is still an option. It's just that we need
ingress and egress to be in sync, so both would be L3 then. So far, the
possible advantages for future use-cases having access to L2 outweighed
the concerns of putting the hook to dev_queue_xmit(), but I'm open to
discussing that.

While I fully disagree with Pablo's point 1) and 2), in the last set I
raised a similar concern as in point 3) wrt __dev_queue_xmit(). The set
as-is would indeed need the L2 info, since a filter could do a load via
LLVM built-ins such as asm("llvm.bpf.load.byte") et al, with BPF_LL_OFF,
where we're forced to do a load relative to skb_mac_header(). As stated
by Daniel already, it would be nice to see the full frame, so it comes
down to a trade-off, but the option of L3 onwards also exists and BPF can
work just fine with it, too. This just means it's placed in the local
output path and the verifier would need to disallow these built-ins during
bpf(2) load time. They are a rather cumbersome legacy anyway, so
bpf_skb_load_bytes() helper can be used instead, which is also easier
to use.

The main concern during the workshop was that a hook only for cgroups
is too specific, but this is actually even more specific than this.

This patch set merely implements an infrastructure that can accommodate
many more things as well in the future. We could, in theory, even add
hooks for forwarded packets specifically, or other eBPF programs, not
even for network filtering etc.

I have nothing against systemd or the needs for more
programmability/flexibility in the stack, but I think this needs to
fulfill some requirements to fit into the infrastructure that we have
in the right way.

Well, as I explained already, this patch set results from endless
discussions that went nowhere, about how such a thing can be achieved
with netfilter.

Thanks,
Daniel

--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html