Re: [PATCH net-next v1 6/7] net: fix SO_DEVMEM_DONTNEED looping too long

Stanislav Fomichev <stfomichev@xxxxxxxxx> · Tue, 5 Nov 2024 13:46:06 -0800

On 11/05, Mina Almasry wrote:
> On Wed, Oct 30, 2024 at 8:07 AM Stanislav Fomichev <stfomichev@xxxxxxxxx> wrote:
> >
> > On 10/30, Mina Almasry wrote:
> > > On Wed, Oct 30, 2024 at 7:33 AM Stanislav Fomichev <stfomichev@xxxxxxxxx> wrote:
> > > >
> > > > On 10/29, Mina Almasry wrote:
> > > > > Check we're going to free a reasonable number of frags in token_count
> > > > > before starting the loop, to prevent looping too long.
> > > > >
> > > > > Also minor code cleanups:
> > > > > - Flip checks to reduce indentation.
> > > > > - Use sizeof(*tokens) everywhere for consistentcy.
> > > > >
> > > > > Cc: Yi Lai <yi1.lai@xxxxxxxxxxxxxxx>
> > > > >
> > > > > Signed-off-by: Mina Almasry <almasrymina@xxxxxxxxxx>
> > > > >
> > > > > ---
> > > > >  net/core/sock.c | 46 ++++++++++++++++++++++++++++------------------
> > > > >  1 file changed, 28 insertions(+), 18 deletions(-)
> > > > >
> > > > > diff --git a/net/core/sock.c b/net/core/sock.c
> > > > > index 7f398bd07fb7..8603b8d87f2e 100644
> > > > > --- a/net/core/sock.c
> > > > > +++ b/net/core/sock.c
> > > > > @@ -1047,11 +1047,12 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
> > > > >
> > > > >  #ifdef CONFIG_PAGE_POOL
> > > > >
> > > > > -/* This is the number of tokens that the user can SO_DEVMEM_DONTNEED in
> > > > > +/* This is the number of frags that the user can SO_DEVMEM_DONTNEED in
> > > > >   * 1 syscall. The limit exists to limit the amount of memory the kernel
> > > > > - * allocates to copy these tokens.
> > > > > + * allocates to copy these tokens, and to prevent looping over the frags for
> > > > > + * too long.
> > > > >   */
> > > > > -#define MAX_DONTNEED_TOKENS 128
> > > > > +#define MAX_DONTNEED_FRAGS 1024
> > > > >
> > > > >  static noinline_for_stack int
> > > > >  sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
> > > > > @@ -1059,43 +1060,52 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
> > > > >       unsigned int num_tokens, i, j, k, netmem_num = 0;
> > > > >       struct dmabuf_token *tokens;
> > > > >       netmem_ref netmems[16];
> > > > > +     u64 num_frags = 0;
> > > > >       int ret = 0;
> > > > >
> > > > >       if (!sk_is_tcp(sk))
> > > > >               return -EBADF;
> > > > >
> > > > > -     if (optlen % sizeof(struct dmabuf_token) ||
> > > > > -         optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS)
> > > > > +     if (optlen % sizeof(*tokens) ||
> > > > > +         optlen > sizeof(*tokens) * MAX_DONTNEED_FRAGS)
> > > > >               return -EINVAL;
> > > > >
> > > > > -     tokens = kvmalloc_array(optlen, sizeof(*tokens), GFP_KERNEL);
> > > > > +     num_tokens = optlen / sizeof(*tokens);
> > > > > +     tokens = kvmalloc_array(num_tokens, sizeof(*tokens), GFP_KERNEL);
> > > > >       if (!tokens)
> > > > >               return -ENOMEM;
> > > > >
> > > > > -     num_tokens = optlen / sizeof(struct dmabuf_token);
> > > > >       if (copy_from_sockptr(tokens, optval, optlen)) {
> > > > >               kvfree(tokens);
> > > > >               return -EFAULT;
> > > > >       }
> > > > >
> > > > > +     for (i = 0; i < num_tokens; i++) {
> > > > > +             num_frags += tokens[i].token_count;
> > > > > +             if (num_frags > MAX_DONTNEED_FRAGS) {
> > > > > +                     kvfree(tokens);
> > > > > +                     return -E2BIG;
> > > > > +             }
> > > > > +     }
> > > > > +
> > > > >       xa_lock_bh(&sk->sk_user_frags);
> > > > >       for (i = 0; i < num_tokens; i++) {
> > > > >               for (j = 0; j < tokens[i].token_count; j++) {
> > > > >                       netmem_ref netmem = (__force netmem_ref)__xa_erase(
> > > > >                               &sk->sk_user_frags, tokens[i].token_start + j);
> > > > >
> > > > > -                     if (netmem &&
> > > > > -                         !WARN_ON_ONCE(!netmem_is_net_iov(netmem))) {
> > > > > -                             netmems[netmem_num++] = netmem;
> > > > > -                             if (netmem_num == ARRAY_SIZE(netmems)) {
> > > > > -                                     xa_unlock_bh(&sk->sk_user_frags);
> > > > > -                                     for (k = 0; k < netmem_num; k++)
> > > > > -                                             WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
> > > > > -                                     netmem_num = 0;
> > > > > -                                     xa_lock_bh(&sk->sk_user_frags);
> > > > > -                             }
> > > > > -                             ret++;
> > > >
> > > > [..]
> > > >
> > > > > +                     if (!netmem || WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
> > > > > +                             continue;
> > > >
> > > > Any reason we are not returning explicit error to the callers here?
> > > > That probably needs some mechanism to signal which particular one failed
> > > > so the users can restart?
> > >
> > > Only because I can't think of a simple way to return an array of frags
> > > failed to DONTNEED to the user.
> >
> > I'd expect the call to return as soon as it hits the invalid frag
> > entry (plus the number of entries that it successfully refilled up to
> > the invalid one). But too late I guess.
> >
> > > Also, this error should be extremely rare or never hit really. I don't
> > > know how we end up not finding a netmem here or the netmem is page.
> > > The only way is if the user is malicious (messing with the token ids
> > > passed to the kernel) or if a kernel bug is happening.
> >
> > I do hit this error with 1500 mtu, so it would've been nice to
> > understand which particular token triggered that. It might be
> > something buggy on the driver side, I need to investigate. (it's
> > super low prio because 1500)
> >
> 
> Hmm, I've never seen this, in production (code is similar to
> upstreamed, but I guess not exactly the same), and in my ncdevmem
> upstream testing.
> 
> FWIW leaked frags are extremely bad, because there is no opportunity
> to reap them until the entire dmabuf has been rebound. You will need
> to root cause this if you're seeing it and are interested in using
> devmem tcp in prod.
> 
> sk_user_frags is only really touched in:
> - sock_devmem_dontneed(), where they are removed from the xarray.
> - tcp_recvmsg_dmabuf(), where they are added to the xarray.
> - tcp_v4_destroy_sock(), where they are freed (but not removed from
> the xarray?!).
> 
> The only root causes for this bug I see are:
> 
> 1. You're racing tcp_v4_destroy_sock() with sock_devmem_dontneed(), so
> somehow you're trying to release a frag already released in that loop?
> Or,
> 2. You're releasing a frag that was never added by
> tcp_recvmsg_dmabuf(). I.e. There is a bug in tcp_recvmsg_dmabuf() that
> it put_cmsg the frag_id to the userspace but never adds it to the
> sk_user_frags. That should be accompanied by a ncdevmem validation
> error.
> 
> The way to debug #2 is really to test with the ncdevmem validation. I
> got the sense from reviewing the test series that you don't like to
> use it, but it's how I root cause such issues. You should familiarize
> yourself with it if you want to root cause such issues as well. To use
> it:
> 
> client: yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06) | tr \\n \\0 \
>           | head -c 1G | nc <server ip> -p 5224
>  server: ncdevmem -s <server IP> -c <client IP> -f eth1 -l -p 5224 -v 7
> 
> If you see a validation error with your missing frag, send me the
> logs, I may be able to guess what's wrong.

Ack, let's put this discussion on the stack and I'll resurrect it
once we have something upstream that's mimicking whatever I have
on my side in terms of generic devmem tx + device drivers :-)

> > > Also, the information is useless to the user. If the user sees 'frag
> > > 128 failed to free'. There is nothing really the user can do to
> > > recover at runtime. Only usefulness that could come is for the user to
> > > log the error. We already WARN_ON_ONCE on the error the user would not
> > > be able to trigger.
> >
> > I'm thinking from the pow of user application. It might have bugs as
> > well and try to refill something that should not have been refilled.
> > Having info about which particular token has failed (even just for
> > the logging purposes) might have been nice.
> 
> Yeah, it may have been nice. On the flip side it complicates calling
> sock_devmem_dontneed(). The userspace need to count the freed frags in
> its input, remove them, skip the leaked one, and re-call the syscall.
> On the flipside the userspace gets to know the id of the frag that
> leaked but the usefulness of the information is slightly questionable
> for me. :shrug:

Right, because I was gonna suggest for this patch, instead of having
a separate extra loop that returns -E2BIG (since this loop is basically
mostly cycles wasted assuming most of the calls are gonna be well behaved),
can we keep num_frags freed as we go and stop and return once
we reach MAX_DONTNEED_FRAGS?

for (i = 0; i < num_tokens; i++) {
	for (j ...) {
		netmem_ref netmem ...
		...
	}
	num_frags += tokens[i].token_count;
	if (num_frags > MAX_DONTNEED_FRAGS)
		return ret;
}

Or do you still find it confusing because userspace has to handle that?