On Tue, Jun 2, 2020 at 2:30 AM Jakub Sitnicki <jakub@xxxxxxxxxxxxxx> wrote: > > On Tue, Jun 02, 2020 at 12:30 AM CEST, Andrii Nakryiko wrote: > > On Sun, May 31, 2020 at 1:29 AM Jakub Sitnicki <jakub@xxxxxxxxxxxxxx> wrote: > >> > >> Extend bpf() syscall subcommands that operate on bpf_link, that is > >> LINK_CREATE, LINK_UPDATE, OBJ_GET_INFO, to accept attach types tied to > >> network namespaces (only flow dissector at the moment). > >> > >> Link-based and prog-based attachment can be used interchangeably, but only > >> one can exist at a time. Attempts to attach a link when a prog is already > >> attached directly, and the other way around, will be met with -EEXIST. > >> Attempts to detach a program when link exists result in -EINVAL. > >> > >> Attachment of multiple links of same attach type to one netns is not > >> supported with the intention to lift the restriction when a use-case > >> presents itself. Because of that link create returns -E2BIG when trying to > >> create another netns link, when one already exists. > >> > >> Link-based attachments to netns don't keep a netns alive by holding a ref > >> to it. Instead links get auto-detached from netns when the latter is being > >> destroyed, using a pernet pre_exit callback. > >> > >> When auto-detached, link lives in defunct state as long there are open FDs > >> for it. -ENOLINK is returned if a user tries to update a defunct link. > >> > >> Because bpf_link to netns doesn't hold a ref to struct net, special care is > >> taken when releasing, updating, or filling link info. The netns might be > >> getting torn down when any of these link operations are in progress. That > >> is why auto-detach and update/release/fill_info are synchronized by the > >> same mutex. Also, link ops have to always check if auto-detach has not > >> happened yet and if netns is still alive (refcnt > 0). > >> > >> Signed-off-by: Jakub Sitnicki <jakub@xxxxxxxxxxxxxx> > >> --- > >> include/linux/bpf-netns.h | 8 ++ > >> include/linux/bpf_types.h | 3 + > >> include/net/netns/bpf.h | 1 + > >> include/uapi/linux/bpf.h | 5 + > >> kernel/bpf/net_namespace.c | 244 ++++++++++++++++++++++++++++++++- > >> kernel/bpf/syscall.c | 3 + > >> tools/include/uapi/linux/bpf.h | 5 + > >> 7 files changed, 267 insertions(+), 2 deletions(-) > >> > > > > [...] > > > >> + > >> +static int bpf_netns_link_update_prog(struct bpf_link *link, > >> + struct bpf_prog *new_prog, > >> + struct bpf_prog *old_prog) > >> +{ > >> + struct bpf_netns_link *net_link = > >> + container_of(link, struct bpf_netns_link, link); > >> + enum netns_bpf_attach_type type = net_link->netns_type; > >> + struct net *net; > >> + int ret = 0; > >> + > >> + if (old_prog && old_prog != link->prog) > >> + return -EPERM; > >> + if (new_prog->type != link->prog->type) > >> + return -EINVAL; > >> + > >> + mutex_lock(&netns_bpf_mutex); > >> + > >> + net = net_link->net; > >> + if (!net || !check_net(net)) { > > > > As is, this check_net() check looks very racy. Because if we do worry > > about net refcnt dropping to zero, then between check_net() and > > accessing net fields that can happen. So if that's a possiblity, you > > should probably instead do maybe_get_net() instead. > > > > But on the other hand, if we established that auto-detach taking a > > mutex protects us from net going away, then maybe we shouldn't worry > > at all about that, and thus check_net() is unnecessary and just > > unnecessarily confusing everything. > > > > I don't know enough overall net lifecycle, so I'm not sure which one > > it is. But the way it is right now still looks suspicious to me. > > The story behind the additional "!check_net(net)" test (in update_prog > and in fill_info) is that without it, user-space will see the link as > defunct only after some delay from the moment last ref to netns is gone > (for instance, last process left the netns). > > That is because netns is being destroyed from a workqueue [0]. Ok, so it's not strictly necessary in terms of correctness, but just nicer behaviro and it's still safe to access net, even if its refcnt drops to zero meanwhile. Ok, thanks for elaborating. > > This unpredictable delay makes testing the uAPI harder, and I expect > using it as well. Since we can do better and check the refcnt to detect > "early" that netns is no good any more, that's what we do. > > At least that was my thinking here. I was hoping the comment below the > check would be enough. Didn't mean to cause confusion. > > So there is no race, the locking scheme you suggested holds. > > [0] https://elixir.bootlin.com/linux/latest/source/net/core/net_namespace.c#L644 > > > > >> + /* Link auto-detached or netns dying */ > >> + ret = -ENOLINK; > >> + goto out_unlock; > >> + } > >> + > >> + old_prog = xchg(&link->prog, new_prog); > >> + rcu_assign_pointer(net->bpf.progs[type], new_prog); > >> + bpf_prog_put(old_prog); > >> + > >> +out_unlock: > >> + mutex_unlock(&netns_bpf_mutex); > >> + return ret; > >> +} > >> + > >> +static int bpf_netns_link_fill_info(const struct bpf_link *link, > >> + struct bpf_link_info *info) > >> +{ > >> + const struct bpf_netns_link *net_link = > >> + container_of(link, struct bpf_netns_link, link); > >> + unsigned int inum = 0; > >> + struct net *net; > >> + > >> + mutex_lock(&netns_bpf_mutex); > >> + net = net_link->net; > >> + if (net && check_net(net)) > >> + inum = net->ns.inum; > >> + mutex_unlock(&netns_bpf_mutex); > >> + > >> + info->netns.netns_ino = inum; > >> + info->netns.attach_type = net_link->type; > >> + return 0; > >> +} > >> + > >> +static void bpf_netns_link_show_fdinfo(const struct bpf_link *link, > >> + struct seq_file *seq) > >> +{ > >> + struct bpf_link_info info = {}; > > > > initialization here is probably not necessary, as long as you access > > only fields that fill_info initializes. > > True. I figured that better safe than leaking stack contents, if > bpf_netns_link_fill_info gets broken. > > > > >> + > >> + bpf_netns_link_fill_info(link, &info); > >> + seq_printf(seq, > >> + "netns_ino:\t%u\n" > >> + "attach_type:\t%u\n", > >> + info.netns.netns_ino, > >> + info.netns.attach_type); > >> +} > >> + > > > > [...]