On 10/29/2016 05:34 PM, Lorenzo Colitti wrote:
On Sat, Oct 29, 2016 at 3:24 PM, Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
it could be solved by swapping the order of cgroup_bpf_run_filter()
and NF_INET_POST_ROUTING in patch 5. It was proposed some time back, but
the current patch, I think, is more symmetrical.
cgroup+bpf runs after nf hook on rx and runs before it on tx.
imo it's more consistent.
I guess what I was trying to say was: what does doing this filtering
in ip_output give you over running this from the netfilter hooks?
Doing this filtering in netfilter is much more general because there
can be complex rules both before and after the filtering is applied. I
hadn't thought of the scalability issue you note below though.
For accounting you probably want to run after the hooks, both for
ingress and for egress, because the hooks can do all sorts of stuff
like drop packets, change packet sizes, reroute them to different
interfaces, etc. Do you see use cases where you want to run before the
hooks?
Fwiw, not sure if swapping brings much, even after netfilter there could
be complex processing that would potentially drop, mangle, redirect, etc
from tc layer (egress or from qdisc itself). But also at even lower layers
(although rather unlikely, but not impossible), for example in drivers or
shortly before passing skb to them during segmentation (GSO), etc.
Eventually, for that you'd need to monitor various things, and the cgroup
one is just at higher layers with different semantics.
Regardless of this choice... are you going to backport cgroupv2 to
android? Because this set is v2 only.
Certainly anything that can't easily be backported to, say,
android-4.4 is not really feasible in the short term. I don't think we
use network cgroups at all, so if v2 network cgroups can coexist with
v1 cgroups of other types (which what little I've read seems to
indicate) then that should be possible.
yes. that's certainly doable, but sooner or later such approach will hit
scalability issue when number of cgroups is large. Same issue we saw
with cls_bpf and bpf_skb_under_cgroup(). Hence this patch set was needed
that is centered around cgroups instead of hooks. Note, unlike, tc and nf
there is no way to attach to a hook. The bpf program is attached to a cgroup.
It's an important distinction vs everything that currently exists in the stack.
Ah, I see. Out of curiosity, what was the first scaling limitation you
hit? eBPF program length? eBPF map size?
The scalability issue is not really program length or map size from eBPF
side in this context. While for v1, you have the bpf_get_cgroup_classid()
helper available on egress (not ingress though) that can scale with larger
number of cgroups since it works on the user-defined net_cls tagging, but
for v2, bpf_skb_under_cgroup() was initially introduced, which can only test
whether the sk's v2 cgroup related to the skb is in the sub-hierarchy of
a specific cgroup that is provided via maps. Effectively, when you have a
larger number of v2 cgroups that boolean test will not scale and you need
to linearly test through various cgroups. It's good enough when need to
special case only few cgroups in the v2 hierarchy on egress. Idea was that
attaching to cgroup itself would resolve this from a different angle for
egress and also ingress in a complementary way, but also seems to open up
for various other use-cases at the same time as seen from various patches
on the list.
Cheers,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html