On Fri, Mar 27, 2020 at 8:59 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > > In Cilium we're mainly using BPF cgroup hooks today in order to implement > kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*), > ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic > between Cilium managed nodes. While this works in its current shape and avoids > packet-level NAT for inter Cilium managed node traffic, there is one major > limitation we're facing today, that is, lack of netns awareness. > > In Kubernetes, the concept of Pods (which hold one or multiple containers) > has been built around network namespaces, so while we can use the global scope > of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing > NodePort ports on loopback addresses), we also have the need to differentiate > between initial network namespaces and non-initial one. For example, ExternalIP > services mandate that non-local service IPs are not to be translated from the > host (initial) network namespace as one example. Right now, we have an ugly > work-around in place where non-local service IPs for ExternalIP services are > not xlated from connect() and friends BPF hooks but instead via less efficient > packet-level NAT on the veth tc ingress hook for Pod traffic. > > On top of determining whether we're in initial or non-initial network namespace > we also have a need for a socket-cookie like mechanism for network namespaces > scope. Socket cookies have the nice property that they can be combined as part > of the key structure e.g. for BPF LRU maps without having to worry that the > cookie could be recycled. We are planning to use this for our sessionAffinity > implementation for services. Therefore, add a new bpf_get_netns_cookie() helper > which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would > provide the cookie for the initial network namespace while passing the context > instead of NULL would provide the cookie from the application's network namespace. > We're using a hole, so no size increase; the assignment happens only once. > Therefore this allows for a comparison on initial namespace as well as regular > cookie usage as we have today with socket cookies. We could later on enable > this helper for other program types as well as we would see need. > > (*) Both externalTrafficPolicy={Local|Cluster} types > [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c > > Signed-off-by: Daniel Borkmann <daniel@xxxxxxxxxxxxx> > --- > include/linux/bpf.h | 1 + > include/net/net_namespace.h | 10 +++++++++ > include/uapi/linux/bpf.h | 16 ++++++++++++++- > kernel/bpf/verifier.c | 16 +++++++++------ > net/core/filter.c | 37 ++++++++++++++++++++++++++++++++++ > net/core/net_namespace.c | 15 ++++++++++++++ > tools/include/uapi/linux/bpf.h | 16 ++++++++++++++- > 7 files changed, 103 insertions(+), 8 deletions(-) > [...] > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index 745f3cfdf3b2..cb30b34d1466 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -3461,13 +3461,17 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno, > expected_type = CONST_PTR_TO_MAP; > if (type != expected_type) > goto err_type; > - } else if (arg_type == ARG_PTR_TO_CTX) { > + } else if (arg_type == ARG_PTR_TO_CTX || > + arg_type == ARG_PTR_TO_CTX_OR_NULL) { > expected_type = PTR_TO_CTX; > - if (type != expected_type) > - goto err_type; > - err = check_ctx_reg(env, reg, regno); > - if (err < 0) > - return err; > + if (!(register_is_null(reg) && > + arg_type == ARG_PTR_TO_CTX_OR_NULL)) { Other parts of check_func_arg() have different pattern that doesn't negate this complicated condition: if (register_is_null(reg) && arg_type == ARG_PTR_TO_CTX_OR_NULL) ; else { ... } It's marginally easier to follow, though still increases nestedness :( > + if (type != expected_type) > + goto err_type; > + err = check_ctx_reg(env, reg, regno); > + if (err < 0) > + return err; > + } > } else if (arg_type == ARG_PTR_TO_SOCK_COMMON) { > expected_type = PTR_TO_SOCK_COMMON; > /* Any sk pointer can be ARG_PTR_TO_SOCK_COMMON */ [...] > +static const struct bpf_func_proto bpf_get_netns_cookie_sock_addr_proto = { > + .func = bpf_get_netns_cookie_sock_addr, > + .gpl_only = false, > + .ret_type = RET_INTEGER, > + .arg1_type = ARG_PTR_TO_CTX_OR_NULL, Just for completeness sake, have you considered using two helpers - one accepting context and other accepting nothing instead of adding ARG_PTR_TO_CTX_OR_NULL? Would it be too bad? > +}; > + [...] > +static atomic64_t cookie_gen; > + > +u64 net_gen_cookie(struct net *net) > +{ > + while (1) { > + u64 res = atomic64_read(&net->net_cookie); > + > + if (res) > + return res; > + res = atomic64_inc_return(&cookie_gen); > + atomic64_cmpxchg(&net->net_cookie, 0, res); you'll always do extra atomic64_read, even if you succeed on the first try. Why not: while (1) { u64 res = atomic64_read(&net->net_cookie); if (res) return res; res = atomic64_inc_return(&cookie_gen); if (atomic64_cmpxchg(&net->net_cookie, 0, res) == 0) return res; } > + } > +} > + [...]