On Sat, 2024-09-14 at 16:40 +0800, Geliang Tang wrote: > Hi Martin, Andrii, Matt, > > On Fri, 2024-09-13 at 17:41 -0700, Martin KaFai Lau wrote: > > On 9/13/24 1:57 PM, Andrii Nakryiko wrote: > > > > > > +__bpf_kfunc int bpf_iter_mptcp_subflow_new(struct > > > > > > bpf_iter_mptcp_subflow *it, > > > > > > + struct > > > > > > mptcp_sock > > > > > > *msk) > > > > > > +{ > > > > > > + struct bpf_iter_mptcp_subflow_kern *kit = (void > > > > > > *)it; > > > > > > + > > > > > > + kit->msk = msk; > > > > > > + if (!msk) > > > > > > + return -EINVAL; > > > > > > + > > > > > > + kit->pos = &msk->conn_list; > > > > > > + return 0; > > > > > > +} > > > > [ ... ] > > > > > > > > BTF_KFUNCS_START(bpf_mptcp_sched_kfunc_ids) > > > > > > +BTF_ID_FLAGS(func, bpf_iter_mptcp_subflow_new) > > > > > > > > > > I'm not 100% sure, but I suspect you might need to specify > > > > > KF_TRUSTED_ARGS here to ensure that `struct mptcp_sock *msk` > > > > > is > > > > > a > > > > +1 > > So we must add KF_TRUSTED_ARGS flag, right? > > > > > > > > > @@ -241,6 +286,8 @@ static int __init > > > > > > bpf_mptcp_kfunc_init(void) > > > > > > int ret; > > > > > > > > > > > > ret = > > > > > > register_btf_fmodret_id_set(&bpf_mptcp_fmodret_set); > > > > > > + ret = ret ?: > > > > > > register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, > > > > > > + > > > > > > &bpf_mptcp_sched_kfunc_set); > > > > This cannot be used in tracing. > > Actually, we don’t need to use mptcp_subflow bpf_iter in tracing. > > We plan to use it in MPTCP BPF packet schedulers, which are not > tracing, but "struct_ops" types. And they work well with > KF_TRUSTED_ARGS flag in bpf_iter_mptcp_subflow_new: > > BTF_ID_FLAGS(func, bpf_iter_mptcp_subflow_new, KF_ITER_NEW | > KF_TRUSTED_ARGS); > > An example of the scheduler is: > > SEC("struct_ops") > int BPF_PROG(bpf_first_get_subflow, struct mptcp_sock *msk, > struct mptcp_sched_data *data) > { > struct mptcp_subflow_context *subflow; > > bpf_rcu_read_lock(); > bpf_for_each(mptcp_subflow, subflow, msk) { > mptcp_subflow_set_scheduled(subflow, true); > break; > } > bpf_rcu_read_unlock(); > > return 0; > } > > SEC(".struct_ops") > struct mptcp_sched_ops first = { > .init = (void *)mptcp_sched_first_init, > .release = (void *)mptcp_sched_first_release, > .get_subflow = (void *)bpf_first_get_subflow, > .name = "bpf_first", > }; > > But BPF mptcp_sched_ops code has not been merged into bpf-next yet, > so > I simply test this bpf_for_each(mptcp_subflow) in tracing since I > noticed other bpf_iter selftests are using tracing too: > > progs/iters_task.c > SEC("fentry.s/" SYS_PREFIX "sys_getpgid") > > progs/iters_css.c > SEC("fentry.s/" SYS_PREFIX "sys_getpgid") > > If this bpf_for_each(mptcp_subflow) can only be used in struct_ops, I > will try to move the selftest into a struct_ops. > > > > > Going back to my earlier question in v1. How is the msk->conn_list > > protected? > > > > msk->conn_list is protected by msk socket lock. (@Matt, am I right?) > We > use this in kernel code: > > struct sock *sk = (struct sock *)msk; > > lock_sock(sk); > kfunc(&msk->conn_list); > release_sock(sk); > > If so, should we also use lock_sock/release_sock in > bpf_iter_mptcp_subflow_next()? bpf_for_each(mptcp_subflow) is expected to be used in the get_subflow() interface of mptcp_sched_ops to traverse all subflows and then pick one subflow to send data. This interface is invoked in mptcp_sched_get_send(), here the msk socket lock is hold already: int mptcp_sched_get_send(struct mptcp_sock *msk) { struct mptcp_subflow_context *subflow; struct mptcp_sched_data data; msk_owned_by_me(msk); ... ... mptcp_for_each_subflow(msk, subflow) { if (READ_ONCE(subflow->scheduled)) return 0; } data.reinject = false; if (msk->sched == &mptcp_sched_default || !msk->sched) return mptcp_sched_default_get_subflow(msk, &data); return msk->sched->get_subflow(msk, &data); } So no need to hold msk socket lock again in BPF schedulers. This means we can walk the conn_list without any lock. bpf_rcu_read_lock() and bpf_rcu_read_unlock() can be dropped in BPF schedulers too. Am I right? Thanks, -Geliang