On 7/3/23 13:19, Magnus Karlsson wrote: > On Mon, 3 Jul 2023 at 13:16, Ilya Maximets <i.maximets@xxxxxxx> wrote: >> >> On 7/3/23 12:24, Magnus Karlsson wrote: >>> On Mon, 3 Jul 2023 at 12:13, Ilya Maximets <i.maximets@xxxxxxx> wrote: >>>> >>>> On 7/3/23 12:06, Ilya Maximets wrote: >>>>> On 7/3/23 11:48, Magnus Karlsson wrote: >>>>>> On Fri, 30 Jun 2023 at 16:58, Ilya Maximets <i.maximets@xxxxxxx> wrote: >>>>>>> >>>>>>> Initial creation of an AF_XDP socket requires CAP_NET_RAW capability. >>>>>>> A privileged process might create the socket and pass it to a >>>>>>> non-privileged process for later use. However, that process will be >>>>>>> able to bind the socket to any network interface. Even though it will >>>>>>> not be able to receive any traffic without modification of the BPF map, >>>>>>> the situation is not ideal. >>>>>>> >>>>>>> Sockets already have a mechanism that can be used to restrict what >>>>>>> interface they can be attached to. That is SO_BINDTODEVICE. >>>>>>> >>>>>>> To change the binding the process will need CAP_NET_RAW. >>>>>>> >>>>>>> Make xsk_bind() honor the SO_BINDTODEVICE in order to allow safer >>>>>>> workflow when non-privileged process is using AF_XDP. >>>>>> >>>>>> Rebinding an AF_XDP socket is not allowed today. Any such attempt will >>>>>> return an error from bind. So if I understand the purpose of >>>>>> SO_BINDTODEVICE correctly, you could say that this option is always >>>>>> set for an AF_XDP socket and it is not possible to toggle it. The only >>>>>> way to "rebind" an AF_XDP socket is to close it and open a new one. >>>>>> This was a conscious design decision from day one as it would be very >>>>>> hard to support this, especially in zero-copy mode. >>>>> >>>>> Hi, Magnus. >>>>> >>>>> The purpose of this patch is not to allow re-binding. The use case is >>>>> following: >>>>> >>>>> 1. First process creates a bare socket with socket(AF_XDP, ...). >>>>> 2. First process loads the XSK program to the interface. >>>>> 3. First process adds the socket fd to a BPF map. >>>>> 4. First process sends socket fd to a second process. >>>>> 5. Second process allocates UMEM. >>>>> 6. Second process binds socket to the interface. >>>> >>>> 7. Second process sends/receives the traffic. :) >>>> >>>>> >>>>> The idea is that the first process will call SO_BINDTODEVICE before >>>>> sending socket fd to a second process, so the second process is limited >>>>> in to which interface it can bind the socket. >>>>> >>>>> Does that make sense? >>> >>> Thanks for explaining this to me. Yes, that makes sense and seems >>> useful. Could you please send a v2 and include the flow (1-7) above in >>> your commit message? Would be good to add one step with the setsockopt >>> SO_BINDTODEVICE before step #4 just to be clear. With those changes >>> please feel free to include my ack: >>> >>> Acked-by: Magnus Karlsson <magnus.karlsson@xxxxxxxxx> >> >> Thanks! I'll update the commit message with the steps above to make it >> more clear. >> >> I was planning to send a non-RFC version of this patch once the tree is >> open (in a week). Or are the rules for bpf-next different? > > Bpf-next is always open I believe. OK. I'll update the commit message and send a formal patch then. Thanks! > >>> >>> Thank you! >>> >>>>> This workflow allows the second process to have no capabilities >>>>> as long as it has sufficient RLIMIT_MEMLOCK. >>>> >>>> Note that steps 1-7 are working just fine today. i.e. the umem >>>> registration, bind, ring mapping and traffic send/receive do not >>>> require any extra capabilities. >>>> >>>> We may restrict the bind() call to require CAP_NET_RAW and then >>>> allow it for sockets that had SO_BINDTODEVICE as an alternative. >>>> But restriction will break the current uAPI. >>>> >>>>> >>>>> Best regards, Ilya Maximets. >>>>> >>>>>> >>>>>>> Signed-off-by: Ilya Maximets <i.maximets@xxxxxxx> >>>>>>> --- >>>>>>> >>>>>>> Posting as an RFC for now to probably get some feedback. >>>>>>> Will re-post once the tree is open. >>>>>>> >>>>>>> Documentation/networking/af_xdp.rst | 9 +++++++++ >>>>>>> net/xdp/xsk.c | 6 ++++++ >>>>>>> 2 files changed, 15 insertions(+) >>>>>>> >>>>>>> diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst >>>>>>> index 247c6c4127e9..1cc35de336a4 100644 >>>>>>> --- a/Documentation/networking/af_xdp.rst >>>>>>> +++ b/Documentation/networking/af_xdp.rst >>>>>>> @@ -433,6 +433,15 @@ start N bytes into the buffer leaving the first N bytes for the >>>>>>> application to use. The final option is the flags field, but it will >>>>>>> be dealt with in separate sections for each UMEM flag. >>>>>>> >>>>>>> +SO_BINDTODEVICE setsockopt >>>>>>> +-------------------------- >>>>>>> + >>>>>>> +This is a generic SOL_SOCKET option that can be used to tie AF_XDP >>>>>>> +socket to a particular network interface. It is useful when a socket >>>>>>> +is created by a privileged process and passed to a non-privileged one. >>>>>>> +Once the option is set, kernel will refuse attempts to bind that socket >>>>>>> +to a different interface. Updating the value requires CAP_NET_RAW. >>>>>>> + >>>>>>> XDP_STATISTICS getsockopt >>>>>>> ------------------------- >>>>>>> >>>>>>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c >>>>>>> index 5a8c0dd250af..386ff641db0f 100644 >>>>>>> --- a/net/xdp/xsk.c >>>>>>> +++ b/net/xdp/xsk.c >>>>>>> @@ -886,6 +886,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len) >>>>>>> struct sock *sk = sock->sk; >>>>>>> struct xdp_sock *xs = xdp_sk(sk); >>>>>>> struct net_device *dev; >>>>>>> + int bound_dev_if; >>>>>>> u32 flags, qid; >>>>>>> int err = 0; >>>>>>> >>>>>>> @@ -899,6 +900,11 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len) >>>>>>> XDP_USE_NEED_WAKEUP)) >>>>>>> return -EINVAL; >>>>>>> >>>>>>> + bound_dev_if = READ_ONCE(sk->sk_bound_dev_if); >>>>>>> + >>>>>>> + if (bound_dev_if && bound_dev_if != sxdp->sxdp_ifindex) >>>>>>> + return -EINVAL; >>>>>>> + >>>>>>> rtnl_lock(); >>>>>>> mutex_lock(&xs->mutex); >>>>>>> if (xs->state != XSK_READY) { >>>>>>> -- >>>>>>> 2.40.1 >>>>>>> >>>>>>> >>>>> >>>> >>