Re: [PATCH bpf 1/2] bpf: fix unconnected udp hooks

Martin Lau <kafai@xxxxxx> · Wed, 5 Jun 2019 23:54:57 +0000

On Wed, Jun 05, 2019 at 10:40:49PM +0200, Daniel Borkmann wrote:
> Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
> to applications as also stated in original motivation in 7828f20e3779 ("Merge
> branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
> two hooks into Cilium to enable host based load-balancing with Kubernetes,
> I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
> typically sets up DNS as a service and is thus subject to load-balancing.
> 
> Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
> is currently insufficent and thus not usable as-is for standard applications
> shipped with most distros. To break down the issue we ran into with a simple
> example:
> 
>   # cat /etc/resolv.conf
>   nameserver 147.75.207.207
>   nameserver 147.75.207.208
> 
> For the purpose of a simple test, we set up above IPs as service IPs and
> transparently redirect traffic to a different DNS backend server for that
> node:
> 
>   # cilium service list
>   ID   Frontend            Backend
>   1    147.75.207.207:53   1 => 8.8.8.8:53
>   2    147.75.207.208:53   1 => 8.8.8.8:53
> 
> The attached BPF program is basically selecting one of the backends if the
> service IP/port matches on the cgroup hook. DNS breaks here, because the
> hooks are not transparent enough to applications which have built-in msg_name
> address checks:
> 
>   # nslookup 1.1.1.1
>   ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
>   ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
>   ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
>   [...]
>   ;; connection timed out; no servers could be reached
> 
>   # dig 1.1.1.1
>   ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
>   ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
>   ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
>   [...]
> 
>   ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
>   ;; global options: +cmd
>   ;; connection timed out; no servers could be reached
> 
> For comparison, if none of the service IPs is used, and we tell nslookup
> to use 8.8.8.8 directly it works just fine, of course:
> 
>   # nslookup 1.1.1.1 8.8.8.8
>   1.1.1.1.in-addr.arpa	name = one.one.one.one.
> 
> In order to fix this and thus act more transparent to the application,
> this needs reverse translation on recvmsg() side. A minimal fix for this
> API is to add similar recvmsg() hooks behind the BPF cgroups static key
> such that the program can track state and replace the current sockaddr_in{,6}
> with the original service IP. From BPF side, this basically tracks the
> service tuple plus socket cookie in an LRU map where the reverse NAT can
> then be retrieved via map value as one example. Side-note: the BPF cgroups
> static key should be converted to a per-hook static key in future.
> 
> Same example after this fix:
> 
>   # cilium service list
>   ID   Frontend            Backend
>   1    147.75.207.207:53   1 => 8.8.8.8:53
>   2    147.75.207.208:53   1 => 8.8.8.8:53
> 
> Lookups work fine now:
> 
>   # nslookup 1.1.1.1
>   1.1.1.1.in-addr.arpa    name = one.one.one.one.
> 
>   Authoritative answers can be found from:
> 
>   # dig 1.1.1.1
> 
>   ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
>   ;; global options: +cmd
>   ;; Got answer:
>   ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
>   ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
> 
>   ;; OPT PSEUDOSECTION:
>   ; EDNS: version: 0, flags:; udp: 512
>   ;; QUESTION SECTION:
>   ;1.1.1.1.                       IN      A
> 
>   ;; AUTHORITY SECTION:
>   .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400
> 
>   ;; Query time: 17 msec
>   ;; SERVER: 147.75.207.207#53(147.75.207.207)
>   ;; WHEN: Tue May 21 12:59:38 UTC 2019
>   ;; MSG SIZE  rcvd: 111
> 
> And from an actual packet level it shows that we're using the back end
> server when talking via 147.75.207.20{7,8} front end:
> 
>   # tcpdump -i any udp
>   [...]
>   12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
>   12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
>   12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
>   12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
>   [...]
> 
> In order to be flexible and to have same semantics as in sendmsg BPF
> programs, we only allow return codes in [1,1] range. In the sendmsg case
> the program is called if msg->msg_name is present which can be the case
> in both, connected and unconnected UDP.
> 
> The former only relies on the sockaddr_in{,6} passed via connect(2) if
> passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
> way to call into the BPF program whenever a non-NULL msg->msg_name was
> passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
> that for TCP case, the msg->msg_name is ignored in the regular recvmsg
> path and therefore not relevant.
> 
> For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
> the hook is not called. This is intentional as it aligns with the same
> semantics as in case of TCP cgroup BPF hooks right now. This might be
> better addressed in future through a different bpf_attach_type such
> that this case can be distinguished from the regular recvmsg paths,
> for example.
LGTM.

They are new hooks.  Should it belong to bpf-next instead?

> 
> Fixes: 1cedee13d25a ("bpf: Hooks for sys_sendmsg")
> Signed-off-by: Daniel Borkmann <daniel@xxxxxxxxxxxxx>
> Cc: Andrey Ignatov <rdna@xxxxxx>
> Cc: Martynas Pumputis <m@xxxxxxxxx>
> ---
>  include/linux/bpf-cgroup.h     |  8 ++++++++
>  include/uapi/linux/bpf.h       |  2 ++
>  kernel/bpf/syscall.c           |  8 ++++++++
>  kernel/bpf/verifier.c          | 12 ++++++++----
>  net/core/filter.c              |  2 ++
>  net/ipv4/udp.c                 |  4 ++++
>  net/ipv6/udp.c                 |  4 ++++
>  tools/bpf/bpftool/cgroup.c     |  5 ++++-
>  tools/include/uapi/linux/bpf.h |  2 ++
Should the bpf.h sync to tools/ be in a separate patch?

[ ... ]

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 95f93544..d2c8a66 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -5361,9 +5361,12 @@ static int check_return_code(struct bpf_verifier_env *env)
>  	struct tnum range = tnum_range(0, 1);
>  
>  	switch (env->prog->type) {
> +	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
> +		if (env->prog->expected_attach_type == BPF_CGROUP_UDP4_RECVMSG ||
> +		    env->prog->expected_attach_type == BPF_CGROUP_UDP6_RECVMSG)
> +			range = tnum_range(1, 1);
>  	case BPF_PROG_TYPE_CGROUP_SKB:
>  	case BPF_PROG_TYPE_CGROUP_SOCK:
> -	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
>  	case BPF_PROG_TYPE_SOCK_OPS:
>  	case BPF_PROG_TYPE_CGROUP_DEVICE:
>  	case BPF_PROG_TYPE_CGROUP_SYSCTL:
> @@ -5380,16 +5383,17 @@ static int check_return_code(struct bpf_verifier_env *env)
>  	}
>  
>  	if (!tnum_in(range, reg->var_off)) {
> +		char tn_buf[48];
> +
>  		verbose(env, "At program exit the register R0 ");
>  		if (!tnum_is_unknown(reg->var_off)) {
> -			char tn_buf[48];
> -
>  			tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
>  			verbose(env, "has value %s", tn_buf);
>  		} else {
>  			verbose(env, "has unknown scalar value");
>  		}
> -		verbose(env, " should have been 0 or 1\n");
> +		tnum_strn(tn_buf, sizeof(tn_buf), range);
> +		verbose(env, " should have been in %s\n", tn_buf);
A heads up that it may have a confict with a bpf-next commit
5cf1e9145630 ("bpf: cgroup inet skb programs can return 0 to 3")

>  		return -EINVAL;
>  	}
>  	return 0;