Re: [PATCH bpf-next 3/7] bpf: add netns cookie and enable it for bpf cgroup hooks

Daniel Borkmann <daniel@xxxxxxxxxxxxx> · Sat, 28 Mar 2020 02:30:54 +0100

On 3/28/20 1:32 AM, Andrii Nakryiko wrote:
On Fri, Mar 27, 2020 at 8:59 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:

In Cilium we're mainly using BPF cgroup hooks today in order to implement
kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*),
ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic
between Cilium managed nodes. While this works in its current shape and avoids
packet-level NAT for inter Cilium managed node traffic, there is one major
limitation we're facing today, that is, lack of netns awareness.

In Kubernetes, the concept of Pods (which hold one or multiple containers)
has been built around network namespaces, so while we can use the global scope
of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing
NodePort ports on loopback addresses), we also have the need to differentiate
between initial network namespaces and non-initial one. For example, ExternalIP
services mandate that non-local service IPs are not to be translated from the
host (initial) network namespace as one example. Right now, we have an ugly
work-around in place where non-local service IPs for ExternalIP services are
not xlated from connect() and friends BPF hooks but instead via less efficient
packet-level NAT on the veth tc ingress hook for Pod traffic.

On top of determining whether we're in initial or non-initial network namespace
we also have a need for a socket-cookie like mechanism for network namespaces
scope. Socket cookies have the nice property that they can be combined as part
of the key structure e.g. for BPF LRU maps without having to worry that the
cookie could be recycled. We are planning to use this for our sessionAffinity
implementation for services. Therefore, add a new bpf_get_netns_cookie() helper
which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would
provide the cookie for the initial network namespace while passing the context
instead of NULL would provide the cookie from the application's network namespace.
We're using a hole, so no size increase; the assignment happens only once.
Therefore this allows for a comparison on initial namespace as well as regular
cookie usage as we have today with socket cookies. We could later on enable
this helper for other program types as well as we would see need.

   (*) Both externalTrafficPolicy={Local|Cluster} types
   [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c

Signed-off-by: Daniel Borkmann <daniel@xxxxxxxxxxxxx>
---
  include/linux/bpf.h            |  1 +
  include/net/net_namespace.h    | 10 +++++++++
  include/uapi/linux/bpf.h       | 16 ++++++++++++++-
  kernel/bpf/verifier.c          | 16 +++++++++------
  net/core/filter.c              | 37 ++++++++++++++++++++++++++++++++++
  net/core/net_namespace.c       | 15 ++++++++++++++
  tools/include/uapi/linux/bpf.h | 16 ++++++++++++++-
  7 files changed, 103 insertions(+), 8 deletions(-)


[...]

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 745f3cfdf3b2..cb30b34d1466 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3461,13 +3461,17 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
                 expected_type = CONST_PTR_TO_MAP;
                 if (type != expected_type)
                         goto err_type;
-       } else if (arg_type == ARG_PTR_TO_CTX) {
+       } else if (arg_type == ARG_PTR_TO_CTX ||
+                  arg_type == ARG_PTR_TO_CTX_OR_NULL) {
                 expected_type = PTR_TO_CTX;
-               if (type != expected_type)
-                       goto err_type;
-               err = check_ctx_reg(env, reg, regno);
-               if (err < 0)
-                       return err;
+               if (!(register_is_null(reg) &&
+                     arg_type == ARG_PTR_TO_CTX_OR_NULL)) {

Other parts of check_func_arg() have different pattern that doesn't
negate this complicated condition:

if (register_is_null(reg) && arg_type == ARG_PTR_TO_CTX_OR_NULL)
     ;
else {
     ...
}

It's marginally easier to follow, though still increases nestedness :(

Yeah, that looks a bit ugly tbh, but perhaps personal style/preference. I
tend to avoid such empty bodies.

+                       if (type != expected_type)
+                               goto err_type;
+                       err = check_ctx_reg(env, reg, regno);
+                       if (err < 0)
+                               return err;
+               }
         } else if (arg_type == ARG_PTR_TO_SOCK_COMMON) {
                 expected_type = PTR_TO_SOCK_COMMON;
                 /* Any sk pointer can be ARG_PTR_TO_SOCK_COMMON */

[...]

+static const struct bpf_func_proto bpf_get_netns_cookie_sock_addr_proto = {
+       .func           = bpf_get_netns_cookie_sock_addr,
+       .gpl_only       = false,
+       .ret_type       = RET_INTEGER,
+       .arg1_type      = ARG_PTR_TO_CTX_OR_NULL,

Just for completeness sake, have you considered using two helpers -
one accepting context and other accepting nothing instead of adding
ARG_PTR_TO_CTX_OR_NULL? Would it be too bad?

I haven't considered it since it feels a bit too much churn to the helper
side for just this simple addition. Both helpers would be doing the same
and I would need to have it duplicated for sock_addr and sock once again
given we need it in both connect et al and bind contexts.

+};
+

[...]

+static atomic64_t cookie_gen;
+
+u64 net_gen_cookie(struct net *net)
+{
+       while (1) {
+               u64 res = atomic64_read(&net->net_cookie);
+
+               if (res)
+                       return res;
+               res = atomic64_inc_return(&cookie_gen);
+               atomic64_cmpxchg(&net->net_cookie, 0, res);

you'll always do extra atomic64_read, even if you succeed on the first
try. Why not:

while (1) {
    u64 res = atomic64_read(&net->net_cookie);
    if (res)
        return res;
    res = atomic64_inc_return(&cookie_gen);
    if (atomic64_cmpxchg(&net->net_cookie, 0, res) == 0)
       return res;
}

Right, though it's on same CPU, so might not make too much of a noticeable
difference. I've used same scheme here as with socket cookies actually. I
thought about consolidating the generator into a separate function for both
as next step and reuse some of the ideas from get_next_ino() for batching to
optimize the atomic op cost, which should make a difference, but will do as
a separate patch.

+       }
+}
+

[...]