On Tue, 11 Jun 2019 at 10:42, Ilya Maximets <i.maximets@xxxxxxxxxxx> wrote: > > On 11.06.2019 11:09, Björn Töpel wrote: > > On Mon, 10 Jun 2019 at 22:49, Jonathan Lemon <jonathan.lemon@xxxxxxxxx> wrote: > >> > >> On 10 Jun 2019, at 9:15, Ilya Maximets wrote: > >> > >>> Device that bound to XDP socket will not have zero refcount until the > >>> userspace application will not close it. This leads to hang inside > >>> 'netdev_wait_allrefs()' if device unregistering requested: > >>> > >>> # ip link del p1 > >>> < hang on recvmsg on netlink socket > > >>> > >>> # ps -x | grep ip > >>> 5126 pts/0 D+ 0:00 ip link del p1 > >>> > >>> # journalctl -b > >>> > >>> Jun 05 07:19:16 kernel: > >>> unregister_netdevice: waiting for p1 to become free. Usage count = 1 > >>> > >>> Jun 05 07:19:27 kernel: > >>> unregister_netdevice: waiting for p1 to become free. Usage count = 1 > >>> ... > >>> > >>> Fix that by implementing NETDEV_UNREGISTER event notification handler > >>> to properly clean up all the resources and unref device. > >>> > >>> This should also allow socket killing via ss(8) utility. > >>> > >>> Fixes: 965a99098443 ("xsk: add support for bind for Rx") > >>> Signed-off-by: Ilya Maximets <i.maximets@xxxxxxxxxxx> > >>> --- > >>> > >>> Version 3: > >>> > >>> * Declaration lines ordered from longest to shortest. > >>> * Checking of event type moved to the top to avoid unnecessary > >>> locking. > >>> > >>> Version 2: > >>> > >>> * Completely re-implemented using netdev event handler. > >>> > >>> net/xdp/xsk.c | 65 > >>> ++++++++++++++++++++++++++++++++++++++++++++++++++- > >>> 1 file changed, 64 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > >>> index a14e8864e4fa..273a419a8c4d 100644 > >>> --- a/net/xdp/xsk.c > >>> +++ b/net/xdp/xsk.c > >>> @@ -693,6 +693,57 @@ static int xsk_mmap(struct file *file, struct > >>> socket *sock, > >>> size, vma->vm_page_prot); > >>> } > >>> > >>> +static int xsk_notifier(struct notifier_block *this, > >>> + unsigned long msg, void *ptr) > >>> +{ > >>> + struct net_device *dev = netdev_notifier_info_to_dev(ptr); > >>> + struct net *net = dev_net(dev); > >>> + int i, unregister_count = 0; > >>> + struct sock *sk; > >>> + > >>> + switch (msg) { > >>> + case NETDEV_UNREGISTER: > >>> + mutex_lock(&net->xdp.lock); > >> > >> The call is under the rtnl lock, and we're not modifying > >> the list, so this mutex shouldn't be needed. > >> > > > > The list can, however, be modified outside the rtnl lock (e.g. at > > socket creation). AFAIK the hlist cannot be traversed lock-less, > > right? > > We could use 'rcu_read_lock' instead and iterate with 'sk_for_each_rcu', > but we'll not be able to synchronize inside. > > > > >> > >>> + sk_for_each(sk, &net->xdp.list) { > >>> + struct xdp_sock *xs = xdp_sk(sk); > >>> + > >>> + mutex_lock(&xs->mutex); > >>> + if (dev != xs->dev) { > >>> + mutex_unlock(&xs->mutex); > >>> + continue; > >>> + } > >>> + > >>> + sk->sk_err = ENETDOWN; > >>> + if (!sock_flag(sk, SOCK_DEAD)) > >>> + sk->sk_error_report(sk); > >>> + > >>> + /* Wait for driver to stop using the xdp socket. */ > >>> + xdp_del_sk_umem(xs->umem, xs); > >>> + xs->dev = NULL; > >>> + synchronize_net(); > >> Isn't this by handled by the unregister_count case below? > >> > > > > To clarify, setting dev to NULL and xdp_del_sk_umem() + sync makes > > sure that a driver doesn't touch the Tx and Rx rings. Nothing can be > > assumed about completion + fill ring (umem), until zero-copy has been > > disabled via ndo_bpf. > > > >>> + > >>> + /* Clear device references in umem. */ > >>> + xdp_put_umem(xs->umem); > >>> + xs->umem = NULL; > >> > >> This makes me uneasy. We need to unregister the umem from > >> the device (xdp_umem_clear_dev()) but this can remove the umem > >> pages out from underneath the xsk. > >> > > > > Yes, this is scary. The socket is alive, and userland typically has > > the fill/completion rings mmapped. Then the umem refcount is decreased > > and can potentially free the umem (fill rings etc.), as Jonathan says, > > underneath the xsk. Also, setting the xs umem/dev to zero, while the > > socket is alive, would allow a user to re-setup the socket, which we > > don't want to allow. > > > >> Perhaps what's needed here is the equivalent of an unbind() > >> call that just detaches the umem/sk from the device, but does > >> not otherwise tear them down. > >> > > > > Yeah, I agree. A detached/zombie state is needed during the socket lifetime. > > > I could try to rip the 'xdp_umem_release()' apart so the 'xdp_umem_clear_dev()' > could be called separately. This will allow to not tear down the 'umem'. > However, it seems that it'll not be easy to synchronize all parts. > Any suggestions are welcome. > Thanks for continuing to work on this, Ilya. What need to be done is exactly an "unbind()", i.e. returning the socket to the state prior bind, but disallowing any changes from userland (e.g. setsockopt/bind). So, unbind() + track that we're in "unbound" mode. :-) I think breaking up xdp_umem_release() is good way to go. > Also, there is no way to not clear the 'dev' as we have to put the reference. > Maybe we could add the additional check to 'xsk_bind()' for current device > state (dev->reg_state == NETREG_REGISTERED). This will allow us to avoid > re-setup of the socket. > Yes, and also make sure that the rest of the syscall implementations don't allow for re-setup. Björn > > > >> > >>> + mutex_unlock(&xs->mutex); > >>> + unregister_count++; > >>> + } > >>> + mutex_unlock(&net->xdp.lock); > >>> + > >>> + if (unregister_count) { > >>> + /* Wait for umem clearing completion. */ > >>> + synchronize_net(); > >>> + for (i = 0; i < unregister_count; i++) > >>> + dev_put(dev); > >>> + } > >>> + > >>> + break; > >>> + } > >>> + > >>> + return NOTIFY_DONE; > >>> +} > >>> + > >>> static struct proto xsk_proto = { > >>> .name = "XDP", > >>> .owner = THIS_MODULE, > >>> @@ -727,7 +778,8 @@ static void xsk_destruct(struct sock *sk) > >>> if (!sock_flag(sk, SOCK_DEAD)) > >>> return; > >>> > >>> - xdp_put_umem(xs->umem); > >>> + if (xs->umem) > >>> + xdp_put_umem(xs->umem); > >> Not needed - xdp_put_umem() already does a null check. > > Indeed. Thanks. > > >> -- > >> Jonathan > >> > >> > >>> > >>> sk_refcnt_debug_dec(sk); > >>> } > >>> @@ -784,6 +836,10 @@ static const struct net_proto_family > >>> xsk_family_ops = { > >>> .owner = THIS_MODULE, > >>> }; > >>> > >>> +static struct notifier_block xsk_netdev_notifier = { > >>> + .notifier_call = xsk_notifier, > >>> +}; > >>> + > >>> static int __net_init xsk_net_init(struct net *net) > >>> { > >>> mutex_init(&net->xdp.lock); > >>> @@ -816,8 +872,15 @@ static int __init xsk_init(void) > >>> err = register_pernet_subsys(&xsk_net_ops); > >>> if (err) > >>> goto out_sk; > >>> + > >>> + err = register_netdevice_notifier(&xsk_netdev_notifier); > >>> + if (err) > >>> + goto out_pernet; > >>> + > >>> return 0; > >>> > >>> +out_pernet: > >>> + unregister_pernet_subsys(&xsk_net_ops); > >>> out_sk: > >>> sock_unregister(PF_XDP); > >>> out_proto: > >>> -- > >>> 2.17.1 > > > >