On Mon, 3 Jul 2023 at 12:13, Ilya Maximets <i.maximets@xxxxxxx> wrote: > > On 7/3/23 12:06, Ilya Maximets wrote: > > On 7/3/23 11:48, Magnus Karlsson wrote: > >> On Fri, 30 Jun 2023 at 16:58, Ilya Maximets <i.maximets@xxxxxxx> wrote: > >>> > >>> Initial creation of an AF_XDP socket requires CAP_NET_RAW capability. > >>> A privileged process might create the socket and pass it to a > >>> non-privileged process for later use. However, that process will be > >>> able to bind the socket to any network interface. Even though it will > >>> not be able to receive any traffic without modification of the BPF map, > >>> the situation is not ideal. > >>> > >>> Sockets already have a mechanism that can be used to restrict what > >>> interface they can be attached to. That is SO_BINDTODEVICE. > >>> > >>> To change the binding the process will need CAP_NET_RAW. > >>> > >>> Make xsk_bind() honor the SO_BINDTODEVICE in order to allow safer > >>> workflow when non-privileged process is using AF_XDP. > >> > >> Rebinding an AF_XDP socket is not allowed today. Any such attempt will > >> return an error from bind. So if I understand the purpose of > >> SO_BINDTODEVICE correctly, you could say that this option is always > >> set for an AF_XDP socket and it is not possible to toggle it. The only > >> way to "rebind" an AF_XDP socket is to close it and open a new one. > >> This was a conscious design decision from day one as it would be very > >> hard to support this, especially in zero-copy mode. > > > > Hi, Magnus. > > > > The purpose of this patch is not to allow re-binding. The use case is > > following: > > > > 1. First process creates a bare socket with socket(AF_XDP, ...). > > 2. First process loads the XSK program to the interface. > > 3. First process adds the socket fd to a BPF map. > > 4. First process sends socket fd to a second process. > > 5. Second process allocates UMEM. > > 6. Second process binds socket to the interface. > > 7. Second process sends/receives the traffic. :) > > > > > The idea is that the first process will call SO_BINDTODEVICE before > > sending socket fd to a second process, so the second process is limited > > in to which interface it can bind the socket. > > > > Does that make sense? Thanks for explaining this to me. Yes, that makes sense and seems useful. Could you please send a v2 and include the flow (1-7) above in your commit message? Would be good to add one step with the setsockopt SO_BINDTODEVICE before step #4 just to be clear. With those changes please feel free to include my ack: Acked-by: Magnus Karlsson <magnus.karlsson@xxxxxxxxx> Thank you! > > This workflow allows the second process to have no capabilities > > as long as it has sufficient RLIMIT_MEMLOCK. > > Note that steps 1-7 are working just fine today. i.e. the umem > registration, bind, ring mapping and traffic send/receive do not > require any extra capabilities. > > We may restrict the bind() call to require CAP_NET_RAW and then > allow it for sockets that had SO_BINDTODEVICE as an alternative. > But restriction will break the current uAPI. > > > > > Best regards, Ilya Maximets. > > > >> > >>> Signed-off-by: Ilya Maximets <i.maximets@xxxxxxx> > >>> --- > >>> > >>> Posting as an RFC for now to probably get some feedback. > >>> Will re-post once the tree is open. > >>> > >>> Documentation/networking/af_xdp.rst | 9 +++++++++ > >>> net/xdp/xsk.c | 6 ++++++ > >>> 2 files changed, 15 insertions(+) > >>> > >>> diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst > >>> index 247c6c4127e9..1cc35de336a4 100644 > >>> --- a/Documentation/networking/af_xdp.rst > >>> +++ b/Documentation/networking/af_xdp.rst > >>> @@ -433,6 +433,15 @@ start N bytes into the buffer leaving the first N bytes for the > >>> application to use. The final option is the flags field, but it will > >>> be dealt with in separate sections for each UMEM flag. > >>> > >>> +SO_BINDTODEVICE setsockopt > >>> +-------------------------- > >>> + > >>> +This is a generic SOL_SOCKET option that can be used to tie AF_XDP > >>> +socket to a particular network interface. It is useful when a socket > >>> +is created by a privileged process and passed to a non-privileged one. > >>> +Once the option is set, kernel will refuse attempts to bind that socket > >>> +to a different interface. Updating the value requires CAP_NET_RAW. > >>> + > >>> XDP_STATISTICS getsockopt > >>> ------------------------- > >>> > >>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > >>> index 5a8c0dd250af..386ff641db0f 100644 > >>> --- a/net/xdp/xsk.c > >>> +++ b/net/xdp/xsk.c > >>> @@ -886,6 +886,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len) > >>> struct sock *sk = sock->sk; > >>> struct xdp_sock *xs = xdp_sk(sk); > >>> struct net_device *dev; > >>> + int bound_dev_if; > >>> u32 flags, qid; > >>> int err = 0; > >>> > >>> @@ -899,6 +900,11 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len) > >>> XDP_USE_NEED_WAKEUP)) > >>> return -EINVAL; > >>> > >>> + bound_dev_if = READ_ONCE(sk->sk_bound_dev_if); > >>> + > >>> + if (bound_dev_if && bound_dev_if != sxdp->sxdp_ifindex) > >>> + return -EINVAL; > >>> + > >>> rtnl_lock(); > >>> mutex_lock(&xs->mutex); > >>> if (xs->state != XSK_READY) { > >>> -- > >>> 2.40.1 > >>> > >>> > > >