Re: [RFC bpf-next] xsk: honor SO_BINDTODEVICE on bind

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 7/3/23 13:19, Magnus Karlsson wrote:
> On Mon, 3 Jul 2023 at 13:16, Ilya Maximets <i.maximets@xxxxxxx> wrote:
>>
>> On 7/3/23 12:24, Magnus Karlsson wrote:
>>> On Mon, 3 Jul 2023 at 12:13, Ilya Maximets <i.maximets@xxxxxxx> wrote:
>>>>
>>>> On 7/3/23 12:06, Ilya Maximets wrote:
>>>>> On 7/3/23 11:48, Magnus Karlsson wrote:
>>>>>> On Fri, 30 Jun 2023 at 16:58, Ilya Maximets <i.maximets@xxxxxxx> wrote:
>>>>>>>
>>>>>>> Initial creation of an AF_XDP socket requires CAP_NET_RAW capability.
>>>>>>> A privileged process might create the socket and pass it to a
>>>>>>> non-privileged process for later use.  However, that process will be
>>>>>>> able to bind the socket to any network interface.  Even though it will
>>>>>>> not be able to receive any traffic without modification of the BPF map,
>>>>>>> the situation is not ideal.
>>>>>>>
>>>>>>> Sockets already have a mechanism that can be used to restrict what
>>>>>>> interface they can be attached to.  That is SO_BINDTODEVICE.
>>>>>>>
>>>>>>> To change the binding the process will need CAP_NET_RAW.
>>>>>>>
>>>>>>> Make xsk_bind() honor the SO_BINDTODEVICE in order to allow safer
>>>>>>> workflow when non-privileged process is using AF_XDP.
>>>>>>
>>>>>> Rebinding an AF_XDP socket is not allowed today. Any such attempt will
>>>>>> return an error from bind. So if I understand the purpose of
>>>>>> SO_BINDTODEVICE correctly, you could say that this option is always
>>>>>> set for an AF_XDP socket and it is not possible to toggle it. The only
>>>>>> way to "rebind" an AF_XDP socket is to close it and open a new one.
>>>>>> This was a conscious design decision from day one as it would be very
>>>>>> hard to support this, especially in zero-copy mode.
>>>>>
>>>>> Hi, Magnus.
>>>>>
>>>>> The purpose of this patch is not to allow re-binding.  The use case is
>>>>> following:
>>>>>
>>>>> 1. First process creates a bare socket with socket(AF_XDP, ...).
>>>>> 2. First process loads the XSK program to the interface.
>>>>> 3. First process adds the socket fd to a BPF map.
>>>>> 4. First process sends socket fd to a second process.
>>>>> 5. Second process allocates UMEM.
>>>>> 6. Second process binds socket to the interface.
>>>>
>>>> 7. Second process sends/receives the traffic. :)
>>>>
>>>>>
>>>>> The idea is that the first process will call SO_BINDTODEVICE before
>>>>> sending socket fd to a second process, so the second process is limited
>>>>> in to which interface it can bind the socket.
>>>>>
>>>>> Does that make sense?
>>>
>>> Thanks for explaining this to me. Yes, that makes sense and seems
>>> useful. Could you please send a v2 and include the flow (1-7) above in
>>> your commit message? Would be good to add one step with the setsockopt
>>> SO_BINDTODEVICE before step #4 just to be clear. With those changes
>>> please feel free to include my ack:
>>>
>>>  Acked-by: Magnus Karlsson <magnus.karlsson@xxxxxxxxx>
>>
>> Thanks!  I'll update the commit message with the steps above to make it
>> more clear.
>>
>> I was planning to send a non-RFC version of this patch once the tree is
>> open (in a week).  Or are the rules for bpf-next different?
> 
> Bpf-next is always open I believe.

OK.  I'll update the commit message and send a formal patch then.  Thanks!

> 
>>>
>>> Thank you!
>>>
>>>>> This workflow allows the second process to have no capabilities
>>>>> as long as it has sufficient RLIMIT_MEMLOCK.
>>>>
>>>> Note that steps 1-7 are working just fine today.  i.e. the umem
>>>> registration, bind, ring mapping and traffic send/receive do not
>>>> require any extra capabilities.
>>>>
>>>> We may restrict the bind() call to require CAP_NET_RAW and then
>>>> allow it for sockets that had SO_BINDTODEVICE as an alternative.
>>>> But restriction will break the current uAPI.
>>>>
>>>>>
>>>>> Best regards, Ilya Maximets.
>>>>>
>>>>>>
>>>>>>> Signed-off-by: Ilya Maximets <i.maximets@xxxxxxx>
>>>>>>> ---
>>>>>>>
>>>>>>> Posting as an RFC for now to probably get some feedback.
>>>>>>> Will re-post once the tree is open.
>>>>>>>
>>>>>>>  Documentation/networking/af_xdp.rst | 9 +++++++++
>>>>>>>  net/xdp/xsk.c                       | 6 ++++++
>>>>>>>  2 files changed, 15 insertions(+)
>>>>>>>
>>>>>>> diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
>>>>>>> index 247c6c4127e9..1cc35de336a4 100644
>>>>>>> --- a/Documentation/networking/af_xdp.rst
>>>>>>> +++ b/Documentation/networking/af_xdp.rst
>>>>>>> @@ -433,6 +433,15 @@ start N bytes into the buffer leaving the first N bytes for the
>>>>>>>  application to use. The final option is the flags field, but it will
>>>>>>>  be dealt with in separate sections for each UMEM flag.
>>>>>>>
>>>>>>> +SO_BINDTODEVICE setsockopt
>>>>>>> +--------------------------
>>>>>>> +
>>>>>>> +This is a generic SOL_SOCKET option that can be used to tie AF_XDP
>>>>>>> +socket to a particular network interface.  It is useful when a socket
>>>>>>> +is created by a privileged process and passed to a non-privileged one.
>>>>>>> +Once the option is set, kernel will refuse attempts to bind that socket
>>>>>>> +to a different interface.  Updating the value requires CAP_NET_RAW.
>>>>>>> +
>>>>>>>  XDP_STATISTICS getsockopt
>>>>>>>  -------------------------
>>>>>>>
>>>>>>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>>>>>>> index 5a8c0dd250af..386ff641db0f 100644
>>>>>>> --- a/net/xdp/xsk.c
>>>>>>> +++ b/net/xdp/xsk.c
>>>>>>> @@ -886,6 +886,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>>>>>>>         struct sock *sk = sock->sk;
>>>>>>>         struct xdp_sock *xs = xdp_sk(sk);
>>>>>>>         struct net_device *dev;
>>>>>>> +       int bound_dev_if;
>>>>>>>         u32 flags, qid;
>>>>>>>         int err = 0;
>>>>>>>
>>>>>>> @@ -899,6 +900,11 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>>>>>>>                       XDP_USE_NEED_WAKEUP))
>>>>>>>                 return -EINVAL;
>>>>>>>
>>>>>>> +       bound_dev_if = READ_ONCE(sk->sk_bound_dev_if);
>>>>>>> +
>>>>>>> +       if (bound_dev_if && bound_dev_if != sxdp->sxdp_ifindex)
>>>>>>> +               return -EINVAL;
>>>>>>> +
>>>>>>>         rtnl_lock();
>>>>>>>         mutex_lock(&xs->mutex);
>>>>>>>         if (xs->state != XSK_READY) {
>>>>>>> --
>>>>>>> 2.40.1
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>




[Index of Archives]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux