On Wed, Jun 24, 2020 at 11:16 AM Jakub Sitnicki <jakub@xxxxxxxxxxxxxx> wrote: > > On Wed, Jun 24, 2020 at 07:47 PM CEST, Andrii Nakryiko wrote: > > On Wed, Jun 24, 2020 at 10:19 AM Jakub Sitnicki <jakub@xxxxxxxxxxxxxx> wrote: > >> > >> On Tue, Jun 23, 2020 at 12:34 PM CEST, Jakub Sitnicki wrote: > >> > Prepare for having multi-prog attachments for new netns attach types by > >> > storing programs to run in a bpf_prog_array, which is well suited for > >> > iterating over programs and running them in sequence. > >> > > >> > Because bpf_prog_array is dynamically resized, after this change a > >> > potentially blocking memory allocation in bpf(PROG_QUERY) callback can > >> > happen, in order to collect program IDs before copying the values to > >> > user-space supplied buffer. This forces us to adapt how we protect access > >> > to the attached program in the callback. As bpf_prog_array_copy_to_user() > >> > helper can sleep, we switch from an RCU read lock to holding a mutex that > >> > serializes updaters. > >> > > >> > To handle bpf(PROG_ATTACH) scenario when we are replacing an already > >> > attached program, we introduce a new bpf_prog_array helper called > >> > bpf_prog_array_replace_item that will exchange the old program with a new > >> > one. bpf-cgroup does away with such helper by computing an index into the > >> > array from a program position in an external list of attached > >> > programs/links. Such approach fails when a dummy prog is left in the array > >> > after a memory allocation failure on link release, but is necessary in > >> > bpf-cgroup case because the same BPF program can be present in the array > >> > multiple times due to inheritance, and attachment cannot be reliably > >> > identified by bpf_prog pointer comparison. > >> > > >> > No functional changes intended. > >> > > >> > Acked-by: Andrii Nakryiko <andriin@xxxxxx> > >> > Signed-off-by: Jakub Sitnicki <jakub@xxxxxxxxxxxxxx> > >> > --- > >> > include/linux/bpf.h | 3 + > >> > include/net/netns/bpf.h | 5 +- > >> > kernel/bpf/core.c | 20 ++++-- > >> > kernel/bpf/net_namespace.c | 137 +++++++++++++++++++++++++++---------- > >> > net/core/flow_dissector.c | 21 +++--- > >> > 5 files changed, 132 insertions(+), 54 deletions(-) > >> > > >> > >> [...] > >> > >> > diff --git a/kernel/bpf/net_namespace.c b/kernel/bpf/net_namespace.c > >> > index b951dab2687f..593523a22168 100644 > >> > --- a/kernel/bpf/net_namespace.c > >> > +++ b/kernel/bpf/net_namespace.c > >> > >> [...] > >> > >> > @@ -93,8 +108,16 @@ static int bpf_netns_link_update_prog(struct bpf_link *link, > >> > goto out_unlock; > >> > } > >> > > >> > + run_array = rcu_dereference_protected(net->bpf.run_array[type], > >> > + lockdep_is_held(&netns_bpf_mutex)); > >> > + if (run_array) > >> > + ret = bpf_prog_array_replace_item(run_array, link->prog, new_prog); > >> > >> Thinking about this some more, link update should fail with -EINVAL if > >> new_prog already exists in run_array. Same as PROG_ATTACH fails with > >> -EINVAL when trying to attach the same prog for the second time. > >> > >> Otherwise, LINK_UPDATE can lead to having same BPF prog present multiple > >> times in the prog_array, once attaching more than one prog gets enabled. > >> > >> Then we would we end up with the same challenge as bpf-cgroup, that is > >> how to find the program index into the prog_array in presence of > >> dummy_prog's. > > > > If you attach 5 different links having the same bpf_prog, it should be > > allowed and all five bpf_progs should be attached and called 5 times. > > They are independent links, that's the main thing. What specific BPF > > program is attached by the link (or later updated to) shouldn't be of > > any concern here (relative to other attached links/programs). > > > > Attaching the same *link* twice shouldn't be allowed, though. > > Thanks for clarifying. I need to change the approach then: > > 1) find the prog index into prog_array by iterating the list of links, > 2) adjust the index for any dummy progs in prog_array below the index. > > That might work for bpf-cgroup too. > I thought that's what bpf-cgroup already does... Except right now there could be no dummy progs, but if we do non-failing detachment, there might be. Then hierarchies can get out of sync and we need to handle that nicely. It's not super straightforward, that's why I said that it's a nice challenge to consider :) But here we don't have hierarchy, it's a happy place to be in :) > The only other alternative I can think of it to copy the prog array to > filter out dummy_progs, before replacing the prog on link update. Probably over-complication. I'd filter dummy progs only on new link attachment or detachment. Update can be kept very simple. > > > > >> > >> > + else > >> > + ret = -ENOENT; > >> > + if (ret) > >> > + goto out_unlock; > >> > + > >> > old_prog = xchg(&link->prog, new_prog); > >> > - rcu_assign_pointer(net->bpf.progs[type], new_prog); > >> > bpf_prog_put(old_prog); > >> > > >> > out_unlock: > >> > >> [...] > >> > >> > @@ -217,14 +249,25 @@ int netns_bpf_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog) > >> > if (ret) > >> > goto out_unlock; > >> > > >> > - attached = rcu_dereference_protected(net->bpf.progs[type], > >> > - lockdep_is_held(&netns_bpf_mutex)); > >> > + attached = net->bpf.progs[type]; > >> > if (attached == prog) { > >> > /* The same program cannot be attached twice */ > >> > ret = -EINVAL; > >> > goto out_unlock; > >> > } > >> > - rcu_assign_pointer(net->bpf.progs[type], prog); > >> > + > >> > + run_array = rcu_dereference_protected(net->bpf.run_array[type], > >> > + lockdep_is_held(&netns_bpf_mutex)); > >> > + if (run_array) { > >> > + ret = bpf_prog_array_replace_item(run_array, attached, prog); > >> > >> I didn't consider here that there can be a run_array with a dummy_prog > >> from a link release that failed to allocate memory. > >> > >> In such case bpf_prog_array_replace_item will fail, while we actually > >> want to replace the dummy_prog. > >> > >> The right thing to do is to replace the first item in prog array: > >> > >> if (run_array) { > >> WRITE_ONCE(run_array->items[0].prog, prog); > >> } else { > >> /* allocate a bpf_prog_array */ > >> } > >> > >> This leaves just one user of bpf_prog_array_replace_item(), so I think > >> I'm just going to fold it into its only caller, that is the update_prog > >> callback. > > > > That will change relative order of BPF programs, which I think is bad. > > So I agree that bpf_prog_array_replace_item is not all that useful and > > probably should be dropped. And the right way is to know the position > > of bpf_prog you are trying to replace/delete, just like cgroup case. > > Except cgroup case is even more complicated due to inheritance and > > hierarchy, which luckily you don't have to deal with here. > > I think we are on the same page. > > The write to first item could change the relative prog order, and even > replace the wrong program, but only if we had to deal with a prog_array > larger > 1. > > In this case, direct attachment with PROG_ATTACH to netns, the > prog_array size is always 1, if it has been allocated already. > > That is because I did not plan to enable multi-prog attachment for > flow_dissector with links. And only flow_dissector uses PROG_ATTACH to > netns. I was thinking in general for multi-link case, yeah. For this case, sure. > > > > >> > >> > + } else { > >> > + ret = bpf_prog_array_copy(NULL, NULL, prog, &run_array); > >> > + rcu_assign_pointer(net->bpf.run_array[type], run_array); > >> > + } > >> > + if (ret) > >> > + goto out_unlock; > >> > + > >> > + net->bpf.progs[type] = prog; > >> > if (attached) > >> > bpf_prog_put(attached); > >> > > >> > >> [...]