On 07/13, Alexei Starovoitov wrote: > On Thu, Jul 13, 2023 at 10:14 AM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote: > > > > On 07/13, Geliang Tang wrote: > > > On Thu, Jul 06, 2023 at 10:02:43AM -0700, Stanislav Fomichev wrote: > > > > On 07/06, Geliang Tang wrote: > > > > > As is described in the "How to use MPTCP?" section in MPTCP wiki [1]: > > > > > > > > > > "Your app can create sockets with IPPROTO_MPTCP as the proto: > > > > > ( socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP); ). Legacy apps can be > > > > > forced to create and use MPTCP sockets instead of TCP ones via the > > > > > mptcpize command bundled with the mptcpd daemon." > > > > > > > > > > But the mptcpize (LD_PRELOAD technique) command has some limitations > > > > > [2]: > > > > > > > > > > - it doesn't work if the application is not using libc (e.g. GoLang > > > > > apps) > > > > > - in some envs, it might not be easy to set env vars / change the way > > > > > apps are launched, e.g. on Android > > > > > - mptcpize needs to be launched with all apps that want MPTCP: we could > > > > > have more control from BPF to enable MPTCP only for some apps or all the > > > > > ones of a netns or a cgroup, etc. > > > > > - it is not in BPF, we cannot talk about it at netdev conf. > > > > > > > > > > So this patchset attempts to use BPF to implement functions similer to > > > > > mptcpize. > > > > > > > > > > The main idea is add a hook in sys_socket() to change the protocol id > > > > > from IPPROTO_TCP (or 0) to IPPROTO_MPTCP. > > > > > > > > > > 1. This first solution [3] is using "cgroup/sock_create": > > > > > > > > > > Implement a new helper bpf_mptcpify() to change the protocol id: > > > > > > > > > > +BPF_CALL_1(bpf_mptcpify, struct sock *, sk) > > > > > +{ > > > > > + if (sk && sk_fullsock(sk) && sk->sk_protocol == IPPROTO_TCP) { > > > > > + sk->sk_protocol = IPPROTO_MPTCP; > > > > > + return (unsigned long)sk; > > > > > + } > > > > > + > > > > > + return (unsigned long)NULL; > > > > > +} > > > > > + > > > > > +const struct bpf_func_proto bpf_mptcpify_proto = { > > > > > + .func = bpf_mptcpify, > > > > > + .gpl_only = false, > > > > > + .ret_type = RET_PTR_TO_BTF_ID_OR_NULL, > > > > > + .ret_btf_id = &btf_sock_ids[BTF_SOCK_TYPE_SOCK], > > > > > + .arg1_type = ARG_PTR_TO_CTX, > > > > > +}; > > > > > > > > > > Add a hook in "cgroup/sock_create" section, invoking bpf_mptcpify() > > > > > helper in this hook: > > > > > > > > > > +SEC("cgroup/sock_create") > > > > > +int sock(struct bpf_sock *ctx) > > > > > +{ > > > > > + struct sock *sk; > > > > > + > > > > > + if (ctx->type != SOCK_STREAM) > > > > > + return 1; > > > > > + > > > > > + sk = bpf_mptcpify(ctx); > > > > > + if (!sk) > > > > > + return 1; > > > > > + > > > > > + protocol = sk->sk_protocol; > > > > > + return 1; > > > > > +} > > > > > > > > > > But this solution doesn't work, because the sock_create section is > > > > > hooked at the end of inet_create(). It's too late to change the protocol > > > > > id. > > > > > > > > > > 2. The second solution [4] is using "fentry": > > > > > > > > > > Implement a bpf_mptcpify() helper: > > > > > > > > > > +BPF_CALL_1(bpf_mptcpify, struct socket_args *, args) > > > > > +{ > > > > > + if (args->family == AF_INET && > > > > > + args->type == SOCK_STREAM && > > > > > + (!args->protocol || args->protocol == IPPROTO_TCP)) > > > > > + args->protocol = IPPROTO_MPTCP; > > > > > + > > > > > + return 0; > > > > > +} > > > > > + > > > > > +BTF_ID_LIST(bpf_mptcpify_btf_ids) > > > > > +BTF_ID(struct, socket_args) > > > > > + > > > > > +static const struct bpf_func_proto bpf_mptcpify_proto = { > > > > > + .func = bpf_mptcpify, > > > > > + .gpl_only = false, > > > > > + .ret_type = RET_INTEGER, > > > > > + .arg1_type = ARG_PTR_TO_BTF_ID, > > > > > + .arg1_btf_id = &bpf_mptcpify_btf_ids[0], > > > > > +}; > > > > > > > > > > Add a new wrapper socket_create() in sys_socket() path, passing a > > > > > pointer of struct socket_args (int family; int type; int protocol) to > > > > > the wrapper. > > > > > > > > > > +int socket_create(struct socket_args *args, struct socket **res) > > > > > +{ > > > > > + return sock_create(args->family, args->type, args->protocol, res); > > > > > +} > > > > > +EXPORT_SYMBOL(socket_create); > > > > > + > > > > > /** > > > > > * sock_create_kern - creates a socket (kernel space) > > > > > * @net: net namespace > > > > > @@ -1608,6 +1614,7 @@ EXPORT_SYMBOL(sock_create_kern); > > > > > > > > > > static struct socket *__sys_socket_create(int family, int type, int protocol) > > > > > { > > > > > + struct socket_args args = { 0 }; > > > > > struct socket *sock; > > > > > int retval; > > > > > > > > > > @@ -1621,7 +1628,10 @@ static struct socket *__sys_socket_create(int family, int type, int protocol) > > > > > return ERR_PTR(-EINVAL); > > > > > type &= SOCK_TYPE_MASK; > > > > > > > > > > - retval = sock_create(family, type, protocol, &sock); > > > > > + args.family = family; > > > > > + args.type = type; > > > > > + args.protocol = protocol; > > > > > + retval = socket_create(&args, &sock); > > > > > if (retval < 0) > > > > > return ERR_PTR(retval); > > > > > > > > > > Add "fentry" hook to the newly added wrapper, invoking bpf_mptcpify() > > > > > in the hook: > > > > > > > > > > +SEC("fentry/socket_create") > > > > > +int BPF_PROG(trace_socket_create, void *args, > > > > > + struct socket **res) > > > > > +{ > > > > > + bpf_mptcpify(args); > > > > > + return 0; > > > > > +} > > > > > > > > > > This version works. But it's just a work around version. Since the code > > > > > to add a wrapper to sys_socket() is not very elegant indeed, and it > > > > > shouldn't be accepted by upstream. > > > > > > > > > > 3. The last solution is this patchset itself: > > > > > > > > > > Introduce new program type BPF_PROG_TYPE_CGROUP_SOCKINIT and attach type > > > > > BPF_CGROUP_SOCKINIT on cgroup basis. > > > > > > > > > > Define BPF_CGROUP_RUN_PROG_SOCKINIT() helper, and implement > > > > > __cgroup_bpf_run_sockinit() helper to run a sockinit program: > > > > > > > > > > +#define BPF_CGROUP_RUN_PROG_SOCKINIT(family, type, protocol) \ > > > > > +({ \ > > > > > + int __ret = 0; \ > > > > > + if (cgroup_bpf_enabled(CGROUP_SOCKINIT)) \ > > > > > + __ret = __cgroup_bpf_run_sockinit(family, type, protocol, \ > > > > > + CGROUP_SOCKINIT); \ > > > > > + __ret; \ > > > > > +}) > > > > > + > > > > > > > > > > Invoke BPF_CGROUP_RUN_PROG_SOCKINIT() in __socket_create() to change > > > > > the arguments: > > > > > > > > > > @@ -1469,6 +1469,12 @@ int __sock_create(struct net *net, int family, int type, int protocol, > > > > > struct socket *sock; > > > > > const struct net_proto_family *pf; > > > > > > > > > > + if (!kern) { > > > > > + err = BPF_CGROUP_RUN_PROG_SOCKINIT(&family, &type, &protocol); > > > > > + if (err) > > > > > + return err; > > > > > + } > > > > > + > > > > > /* > > > > > * Check protocol is in range > > > > > */ > > > > > > > > > > Define BPF program in this newly added 'sockinit' SEC, so it will be > > > > > hooked in BPF_CGROUP_RUN_PROG_SOCKINIT() in __socket_create(). > > > > > > > > > > @@ -158,6 +158,11 @@ static struct sec_name_test tests[] = { > > > > > {0, BPF_PROG_TYPE_CGROUP_SOCKOPT, BPF_CGROUP_SETSOCKOPT}, > > > > > {0, BPF_CGROUP_SETSOCKOPT}, > > > > > }, > > > > > + { > > > > > + "cgroup/sockinit", > > > > > + {0, BPF_PROG_TYPE_CGROUP_SOCKINIT, BPF_CGROUP_SOCKINIT}, > > > > > + {0, BPF_CGROUP_SOCKINIT}, > > > > > + }, > > > > > }; > > > > > > > > > > +SEC("cgroup/sockinit") > > > > > +int mptcpify(struct bpf_sockinit_ctx *ctx) > > > > > +{ > > > > > + if ((ctx->family == AF_INET || ctx->family == AF_INET6) && > > > > > + ctx->type == SOCK_STREAM && > > > > > + (!ctx->protocol || ctx->protocol == IPPROTO_TCP)) { > > > > > + ctx->protocol = IPPROTO_MPTCP; > > > > > + } > > > > > + > > > > > + return 1; > > > > > +} > > > > > > > > > > This version is the best solution we have found so far. > > > > > > > > > > [1] > > > > > https://github.com/multipath-tcp/mptcp_net-next/wiki > > > > > [2] > > > > > https://github.com/multipath-tcp/mptcp_net-next/issues/79 > > > > > [3] > > > > > https://patchwork.kernel.org/project/mptcp/cover/cover.1688215769.git.geliang.tang@xxxxxxxx/ > > > > > [4] > > > > > https://patchwork.kernel.org/project/mptcp/cover/cover.1688366249.git.geliang.tang@xxxxxxxx/ > > > > > > > > cgroup/sock_create being weird and triggering late and only for af_inet4/6 > > > > was the reason we added ability to attach to lsm hooks on a per-cgroup > > > > basis: > > > > https://lore.kernel.org/bpf/20220628174314.1216643-1-sdf@xxxxxxxxxx/ > > > > > > > > Unfortunately, using it here won't help either :-( The following > > > > hook triggers early enough but doesn't allow changing the arguments (I was > > > > interested only in filtering based on the arguments, not changing them): > > > > > > > > LSM_HOOK(int, 0, socket_create, int family, int type, int protocol, int kern) > > > > > > > > So maybe another alternative might be to change its definition to int *family, > > > > int *type, int *protocol and use lsm_cgroup/socket_create to rewrite the > > > > > > Thanks Stanislav, this lsm_cgroup/socket_create works. The test patch > > > is attached. > > > > > > > protocol? (some verifier changes might needed to make those writable) > > > > > > But I got some verification errors here: > > > > > > invalid mem access 'scalar'. > > > > > > I tried many times but couldn't solve it, so I simply skipped the > > > verifier in the test patch (I marked TODO in front of this code). > > > Could you please give me some suggestions for verification? > > > > Right, that's the core issue here, to add support to the verifier > > to let it write into scalar pointer arguments. I don't have a good suggestion, > > but we've discussed it multiple times elsewhere that maybe we > > should do it :-) > > > > Can we use some new btf_type_tag("allow_write") to annotate places > > where we know that it's safe to write to them? Then we can extend > > the verifier to check for this condition hopefully. Maybe other BPF > > folks have better suggestions here? > > > > We should also CC LSM people to make sure it's not a no-no to > > change LSM_HOOK(s) in a way to allow changing the arguments. I'm > > not sure how rigid those definitions are :-( > > imo all 3 options including this 4th one are too hacky. > I understand ld_preload limitations and desire to have it per-cgroup, > but messing this much with user space feels a little bit too much. > What side effects will it cause? Maybe all that is really needed is some new per-netns sysctl to automatically upgrade from IPPROTO_TCP to IPPROTO_MPTCP? Or is it too broad of a brush? I've also CC'd netdev for visibility... > Meaning is this enough to just change the proto? > Nothing in user space later on needs to be aware the protocol is so different? IIUC, if you use IPPROTO_MPTCP, you just get regular TCP until you start adding extra routes (via netlink). That's why their current unconditional IPPROTO_TCP->IPPROTO_MPTCP rewrite via ld_preload also somewhat works. > I feel the consequences are too drastic to support such thing > through an official/stable hook. > We can consider an fmod_ret unstable hook somewhere in the kernel > that bpf prog can attach to and tweak the ret value and/or args, > but the production environment won't be using it. > It will be a temporary gap until user space is properly converted to mptcp. Asking every app to do s/IPPROTO_TCP/IPPROTO_MPTCP/ might be annoying though? (don't have a horse in this race, but have some v4->v6 migration vibes from this)