Re: [RFC bpf-next 0/8] BPF 'force to MPTCP'

Stanislav Fomichev <sdf@xxxxxxxxxx> · Thu, 13 Jul 2023 16:09:34 -0700

On 07/13, Alexei Starovoitov wrote:
> On Thu, Jul 13, 2023 at 10:14 AM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote:
> >
> > On 07/13, Geliang Tang wrote:
> > > On Thu, Jul 06, 2023 at 10:02:43AM -0700, Stanislav Fomichev wrote:
> > > > On 07/06, Geliang Tang wrote:
> > > > > As is described in the "How to use MPTCP?" section in MPTCP wiki [1]:
> > > > >
> > > > > "Your app can create sockets with IPPROTO_MPTCP as the proto:
> > > > > ( socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP); ). Legacy apps can be
> > > > > forced to create and use MPTCP sockets instead of TCP ones via the
> > > > > mptcpize command bundled with the mptcpd daemon."
> > > > >
> > > > > But the mptcpize (LD_PRELOAD technique) command has some limitations
> > > > > [2]:
> > > > >
> > > > >  - it doesn't work if the application is not using libc (e.g. GoLang
> > > > > apps)
> > > > >  - in some envs, it might not be easy to set env vars / change the way
> > > > > apps are launched, e.g. on Android
> > > > >  - mptcpize needs to be launched with all apps that want MPTCP: we could
> > > > > have more control from BPF to enable MPTCP only for some apps or all the
> > > > > ones of a netns or a cgroup, etc.
> > > > >  - it is not in BPF, we cannot talk about it at netdev conf.
> > > > >
> > > > > So this patchset attempts to use BPF to implement functions similer to
> > > > > mptcpize.
> > > > >
> > > > > The main idea is add a hook in sys_socket() to change the protocol id
> > > > > from IPPROTO_TCP (or 0) to IPPROTO_MPTCP.
> > > > >
> > > > > 1. This first solution [3] is using "cgroup/sock_create":
> > > > >
> > > > > Implement a new helper bpf_mptcpify() to change the protocol id:
> > > > >
> > > > > +BPF_CALL_1(bpf_mptcpify, struct sock *, sk)
> > > > > +{
> > > > > + if (sk && sk_fullsock(sk) && sk->sk_protocol == IPPROTO_TCP) {
> > > > > +         sk->sk_protocol = IPPROTO_MPTCP;
> > > > > +         return (unsigned long)sk;
> > > > > + }
> > > > > +
> > > > > + return (unsigned long)NULL;
> > > > > +}
> > > > > +
> > > > > +const struct bpf_func_proto bpf_mptcpify_proto = {
> > > > > + .func           = bpf_mptcpify,
> > > > > + .gpl_only       = false,
> > > > > + .ret_type       = RET_PTR_TO_BTF_ID_OR_NULL,
> > > > > + .ret_btf_id     = &btf_sock_ids[BTF_SOCK_TYPE_SOCK],
> > > > > + .arg1_type      = ARG_PTR_TO_CTX,
> > > > > +};
> > > > >
> > > > > Add a hook in "cgroup/sock_create" section, invoking bpf_mptcpify()
> > > > > helper in this hook:
> > > > >
> > > > > +SEC("cgroup/sock_create")
> > > > > +int sock(struct bpf_sock *ctx)
> > > > > +{
> > > > > + struct sock *sk;
> > > > > +
> > > > > + if (ctx->type != SOCK_STREAM)
> > > > > +         return 1;
> > > > > +
> > > > > + sk = bpf_mptcpify(ctx);
> > > > > + if (!sk)
> > > > > +         return 1;
> > > > > +
> > > > > + protocol = sk->sk_protocol;
> > > > > + return 1;
> > > > > +}
> > > > >
> > > > > But this solution doesn't work, because the sock_create section is
> > > > > hooked at the end of inet_create(). It's too late to change the protocol
> > > > > id.
> > > > >
> > > > > 2. The second solution [4] is using "fentry":
> > > > >
> > > > > Implement a bpf_mptcpify() helper:
> > > > >
> > > > > +BPF_CALL_1(bpf_mptcpify, struct socket_args *, args)
> > > > > +{
> > > > > + if (args->family == AF_INET &&
> > > > > +     args->type == SOCK_STREAM &&
> > > > > +     (!args->protocol || args->protocol == IPPROTO_TCP))
> > > > > +         args->protocol = IPPROTO_MPTCP;
> > > > > +
> > > > > + return 0;
> > > > > +}
> > > > > +
> > > > > +BTF_ID_LIST(bpf_mptcpify_btf_ids)
> > > > > +BTF_ID(struct, socket_args)
> > > > > +
> > > > > +static const struct bpf_func_proto bpf_mptcpify_proto = {
> > > > > + .func           = bpf_mptcpify,
> > > > > + .gpl_only       = false,
> > > > > + .ret_type       = RET_INTEGER,
> > > > > + .arg1_type      = ARG_PTR_TO_BTF_ID,
> > > > > + .arg1_btf_id    = &bpf_mptcpify_btf_ids[0],
> > > > > +};
> > > > >
> > > > > Add a new wrapper socket_create() in sys_socket() path, passing a
> > > > > pointer of struct socket_args (int family; int type; int protocol) to
> > > > > the wrapper.
> > > > >
> > > > > +int socket_create(struct socket_args *args, struct socket **res)
> > > > > +{
> > > > > + return sock_create(args->family, args->type, args->protocol, res);
> > > > > +}
> > > > > +EXPORT_SYMBOL(socket_create);
> > > > > +
> > > > >  /**
> > > > >   *       sock_create_kern - creates a socket (kernel space)
> > > > >   *       @net: net namespace
> > > > > @@ -1608,6 +1614,7 @@  EXPORT_SYMBOL(sock_create_kern);
> > > > >
> > > > >  static struct socket *__sys_socket_create(int family, int type, int protocol)
> > > > >  {
> > > > > + struct socket_args args = { 0 };
> > > > >   struct socket *sock;
> > > > >   int retval;
> > > > >
> > > > > @@ -1621,7 +1628,10 @@  static struct socket *__sys_socket_create(int family, int type, int protocol)
> > > > >           return ERR_PTR(-EINVAL);
> > > > >   type &= SOCK_TYPE_MASK;
> > > > >
> > > > > - retval = sock_create(family, type, protocol, &sock);
> > > > > + args.family = family;
> > > > > + args.type = type;
> > > > > + args.protocol = protocol;
> > > > > + retval = socket_create(&args, &sock);
> > > > >   if (retval < 0)
> > > > >           return ERR_PTR(retval);
> > > > >
> > > > > Add "fentry" hook to the newly added wrapper, invoking bpf_mptcpify()
> > > > > in the hook:
> > > > >
> > > > > +SEC("fentry/socket_create")
> > > > > +int BPF_PROG(trace_socket_create, void *args,
> > > > > +         struct socket **res)
> > > > > +{
> > > > > + bpf_mptcpify(args);
> > > > > + return 0;
> > > > > +}
> > > > >
> > > > > This version works. But it's just a work around version. Since the code
> > > > > to add a wrapper to sys_socket() is not very elegant indeed, and it
> > > > > shouldn't be accepted by upstream.
> > > > >
> > > > > 3. The last solution is this patchset itself:
> > > > >
> > > > > Introduce new program type BPF_PROG_TYPE_CGROUP_SOCKINIT and attach type
> > > > > BPF_CGROUP_SOCKINIT on cgroup basis.
> > > > >
> > > > > Define BPF_CGROUP_RUN_PROG_SOCKINIT() helper, and implement
> > > > > __cgroup_bpf_run_sockinit() helper to run a sockinit program:
> > > > >
> > > > > +#define BPF_CGROUP_RUN_PROG_SOCKINIT(family, type, protocol)                    \
> > > > > +({                                                                              \
> > > > > + int __ret = 0;                                                         \
> > > > > + if (cgroup_bpf_enabled(CGROUP_SOCKINIT))                               \
> > > > > +         __ret = __cgroup_bpf_run_sockinit(family, type, protocol,      \
> > > > > +                                           CGROUP_SOCKINIT);            \
> > > > > + __ret;                                                                 \
> > > > > +})
> > > > > +
> > > > >
> > > > > Invoke BPF_CGROUP_RUN_PROG_SOCKINIT() in __socket_create() to change
> > > > > the arguments:
> > > > >
> > > > > @@ -1469,6 +1469,12 @@  int __sock_create(struct net *net, int family, int type, int protocol,
> > > > >   struct socket *sock;
> > > > >   const struct net_proto_family *pf;
> > > > >
> > > > > + if (!kern) {
> > > > > +         err = BPF_CGROUP_RUN_PROG_SOCKINIT(&family, &type, &protocol);
> > > > > +         if (err)
> > > > > +                 return err;
> > > > > + }
> > > > > +
> > > > >   /*
> > > > >    *      Check protocol is in range
> > > > >    */
> > > > >
> > > > > Define BPF program in this newly added 'sockinit' SEC, so it will be
> > > > > hooked in BPF_CGROUP_RUN_PROG_SOCKINIT() in __socket_create().
> > > > >
> > > > > @@ -158,6 +158,11 @@  static struct sec_name_test tests[] = {
> > > > >           {0, BPF_PROG_TYPE_CGROUP_SOCKOPT, BPF_CGROUP_SETSOCKOPT},
> > > > >           {0, BPF_CGROUP_SETSOCKOPT},
> > > > >   },
> > > > > + {
> > > > > +         "cgroup/sockinit",
> > > > > +         {0, BPF_PROG_TYPE_CGROUP_SOCKINIT, BPF_CGROUP_SOCKINIT},
> > > > > +         {0, BPF_CGROUP_SOCKINIT},
> > > > > + },
> > > > >  };
> > > > >
> > > > > +SEC("cgroup/sockinit")
> > > > > +int mptcpify(struct bpf_sockinit_ctx *ctx)
> > > > > +{
> > > > > + if ((ctx->family == AF_INET || ctx->family == AF_INET6) &&
> > > > > +     ctx->type == SOCK_STREAM &&
> > > > > +     (!ctx->protocol || ctx->protocol == IPPROTO_TCP)) {
> > > > > +         ctx->protocol = IPPROTO_MPTCP;
> > > > > + }
> > > > > +
> > > > > + return 1;
> > > > > +}
> > > > >
> > > > > This version is the best solution we have found so far.
> > > > >
> > > > > [1]
> > > > > https://github.com/multipath-tcp/mptcp_net-next/wiki
> > > > > [2]
> > > > > https://github.com/multipath-tcp/mptcp_net-next/issues/79
> > > > > [3]
> > > > > https://patchwork.kernel.org/project/mptcp/cover/cover.1688215769.git.geliang.tang@xxxxxxxx/
> > > > > [4]
> > > > > https://patchwork.kernel.org/project/mptcp/cover/cover.1688366249.git.geliang.tang@xxxxxxxx/
> > > >
> > > > cgroup/sock_create being weird and triggering late and only for af_inet4/6
> > > > was the reason we added ability to attach to lsm hooks on a per-cgroup
> > > > basis:
> > > > https://lore.kernel.org/bpf/20220628174314.1216643-1-sdf@xxxxxxxxxx/
> > > >
> > > > Unfortunately, using it here won't help either :-( The following
> > > > hook triggers early enough but doesn't allow changing the arguments (I was
> > > > interested only in filtering based on the arguments, not changing them):
> > > >
> > > > LSM_HOOK(int, 0, socket_create, int family, int type, int protocol, int kern)
> > > >
> > > > So maybe another alternative might be to change its definition to int *family,
> > > > int *type, int *protocol and use lsm_cgroup/socket_create to rewrite the
> > >
> > > Thanks Stanislav, this lsm_cgroup/socket_create works. The test patch
> > > is attached.
> > >
> > > > protocol? (some verifier changes might needed to make those writable)
> > >
> > > But I got some verification errors here:
> > >
> > >    invalid mem access 'scalar'.
> > >
> > > I tried many times but couldn't solve it, so I simply skipped the
> > > verifier in the test patch (I marked TODO in front of this code).
> > > Could you please give me some suggestions for verification?
> >
> > Right, that's the core issue here, to add support to the verifier
> > to let it write into scalar pointer arguments. I don't have a good suggestion,
> > but we've discussed it multiple times elsewhere that maybe we
> > should do it :-)
> >
> > Can we use some new btf_type_tag("allow_write") to annotate places
> > where we know that it's safe to write to them? Then we can extend
> > the verifier to check for this condition hopefully. Maybe other BPF
> > folks have better suggestions here?
> >
> > We should also CC LSM people to make sure it's not a no-no to
> > change LSM_HOOK(s) in a way to allow changing the arguments. I'm
> > not sure how rigid those definitions are :-(
> 
> imo all 3 options including this 4th one are too hacky.
> I understand ld_preload limitations and desire to have it per-cgroup,
> but messing this much with user space feels a little bit too much.
> What side effects will it cause?

Maybe all that is really needed is some new per-netns sysctl to automatically
upgrade from IPPROTO_TCP to IPPROTO_MPTCP? Or is it too broad of a
brush?

I've also CC'd netdev for visibility...

> Meaning is this enough to just change the proto?
> Nothing in user space later on needs to be aware the protocol is so different?

IIUC, if you use IPPROTO_MPTCP, you just get regular TCP until you start
adding extra routes (via netlink). That's why their current
unconditional IPPROTO_TCP->IPPROTO_MPTCP rewrite via ld_preload also somewhat
works.

> I feel the consequences are too drastic to support such thing
> through an official/stable hook.
> We can consider an fmod_ret unstable hook somewhere in the kernel
> that bpf prog can attach to and tweak the ret value and/or args,
> but the production environment won't be using it.
> It will be a temporary gap until user space is properly converted to mptcp.

Asking every app to do s/IPPROTO_TCP/IPPROTO_MPTCP/ might be annoying
though? (don't have a horse in this race, but have some v4->v6 migration
vibes from this)