On 11/05, Mina Almasry wrote: > On Wed, Oct 30, 2024 at 8:07 AM Stanislav Fomichev <stfomichev@xxxxxxxxx> wrote: > > > > On 10/30, Mina Almasry wrote: > > > On Wed, Oct 30, 2024 at 7:33 AM Stanislav Fomichev <stfomichev@xxxxxxxxx> wrote: > > > > > > > > On 10/29, Mina Almasry wrote: > > > > > Check we're going to free a reasonable number of frags in token_count > > > > > before starting the loop, to prevent looping too long. > > > > > > > > > > Also minor code cleanups: > > > > > - Flip checks to reduce indentation. > > > > > - Use sizeof(*tokens) everywhere for consistentcy. > > > > > > > > > > Cc: Yi Lai <yi1.lai@xxxxxxxxxxxxxxx> > > > > > > > > > > Signed-off-by: Mina Almasry <almasrymina@xxxxxxxxxx> > > > > > > > > > > --- > > > > > net/core/sock.c | 46 ++++++++++++++++++++++++++++------------------ > > > > > 1 file changed, 28 insertions(+), 18 deletions(-) > > > > > > > > > > diff --git a/net/core/sock.c b/net/core/sock.c > > > > > index 7f398bd07fb7..8603b8d87f2e 100644 > > > > > --- a/net/core/sock.c > > > > > +++ b/net/core/sock.c > > > > > @@ -1047,11 +1047,12 @@ static int sock_reserve_memory(struct sock *sk, int bytes) > > > > > > > > > > #ifdef CONFIG_PAGE_POOL > > > > > > > > > > -/* This is the number of tokens that the user can SO_DEVMEM_DONTNEED in > > > > > +/* This is the number of frags that the user can SO_DEVMEM_DONTNEED in > > > > > * 1 syscall. The limit exists to limit the amount of memory the kernel > > > > > - * allocates to copy these tokens. > > > > > + * allocates to copy these tokens, and to prevent looping over the frags for > > > > > + * too long. > > > > > */ > > > > > -#define MAX_DONTNEED_TOKENS 128 > > > > > +#define MAX_DONTNEED_FRAGS 1024 > > > > > > > > > > static noinline_for_stack int > > > > > sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen) > > > > > @@ -1059,43 +1060,52 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen) > > > > > unsigned int num_tokens, i, j, k, netmem_num = 0; > > > > > struct dmabuf_token *tokens; > > > > > netmem_ref netmems[16]; > > > > > + u64 num_frags = 0; > > > > > int ret = 0; > > > > > > > > > > if (!sk_is_tcp(sk)) > > > > > return -EBADF; > > > > > > > > > > - if (optlen % sizeof(struct dmabuf_token) || > > > > > - optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS) > > > > > + if (optlen % sizeof(*tokens) || > > > > > + optlen > sizeof(*tokens) * MAX_DONTNEED_FRAGS) > > > > > return -EINVAL; > > > > > > > > > > - tokens = kvmalloc_array(optlen, sizeof(*tokens), GFP_KERNEL); > > > > > + num_tokens = optlen / sizeof(*tokens); > > > > > + tokens = kvmalloc_array(num_tokens, sizeof(*tokens), GFP_KERNEL); > > > > > if (!tokens) > > > > > return -ENOMEM; > > > > > > > > > > - num_tokens = optlen / sizeof(struct dmabuf_token); > > > > > if (copy_from_sockptr(tokens, optval, optlen)) { > > > > > kvfree(tokens); > > > > > return -EFAULT; > > > > > } > > > > > > > > > > + for (i = 0; i < num_tokens; i++) { > > > > > + num_frags += tokens[i].token_count; > > > > > + if (num_frags > MAX_DONTNEED_FRAGS) { > > > > > + kvfree(tokens); > > > > > + return -E2BIG; > > > > > + } > > > > > + } > > > > > + > > > > > xa_lock_bh(&sk->sk_user_frags); > > > > > for (i = 0; i < num_tokens; i++) { > > > > > for (j = 0; j < tokens[i].token_count; j++) { > > > > > netmem_ref netmem = (__force netmem_ref)__xa_erase( > > > > > &sk->sk_user_frags, tokens[i].token_start + j); > > > > > > > > > > - if (netmem && > > > > > - !WARN_ON_ONCE(!netmem_is_net_iov(netmem))) { > > > > > - netmems[netmem_num++] = netmem; > > > > > - if (netmem_num == ARRAY_SIZE(netmems)) { > > > > > - xa_unlock_bh(&sk->sk_user_frags); > > > > > - for (k = 0; k < netmem_num; k++) > > > > > - WARN_ON_ONCE(!napi_pp_put_page(netmems[k])); > > > > > - netmem_num = 0; > > > > > - xa_lock_bh(&sk->sk_user_frags); > > > > > - } > > > > > - ret++; > > > > > > > > [..] > > > > > > > > > + if (!netmem || WARN_ON_ONCE(!netmem_is_net_iov(netmem))) > > > > > + continue; > > > > > > > > Any reason we are not returning explicit error to the callers here? > > > > That probably needs some mechanism to signal which particular one failed > > > > so the users can restart? > > > > > > Only because I can't think of a simple way to return an array of frags > > > failed to DONTNEED to the user. > > > > I'd expect the call to return as soon as it hits the invalid frag > > entry (plus the number of entries that it successfully refilled up to > > the invalid one). But too late I guess. > > > > > Also, this error should be extremely rare or never hit really. I don't > > > know how we end up not finding a netmem here or the netmem is page. > > > The only way is if the user is malicious (messing with the token ids > > > passed to the kernel) or if a kernel bug is happening. > > > > I do hit this error with 1500 mtu, so it would've been nice to > > understand which particular token triggered that. It might be > > something buggy on the driver side, I need to investigate. (it's > > super low prio because 1500) > > > > Hmm, I've never seen this, in production (code is similar to > upstreamed, but I guess not exactly the same), and in my ncdevmem > upstream testing. > > FWIW leaked frags are extremely bad, because there is no opportunity > to reap them until the entire dmabuf has been rebound. You will need > to root cause this if you're seeing it and are interested in using > devmem tcp in prod. > > sk_user_frags is only really touched in: > - sock_devmem_dontneed(), where they are removed from the xarray. > - tcp_recvmsg_dmabuf(), where they are added to the xarray. > - tcp_v4_destroy_sock(), where they are freed (but not removed from > the xarray?!). > > The only root causes for this bug I see are: > > 1. You're racing tcp_v4_destroy_sock() with sock_devmem_dontneed(), so > somehow you're trying to release a frag already released in that loop? > Or, > 2. You're releasing a frag that was never added by > tcp_recvmsg_dmabuf(). I.e. There is a bug in tcp_recvmsg_dmabuf() that > it put_cmsg the frag_id to the userspace but never adds it to the > sk_user_frags. That should be accompanied by a ncdevmem validation > error. > > The way to debug #2 is really to test with the ncdevmem validation. I > got the sense from reviewing the test series that you don't like to > use it, but it's how I root cause such issues. You should familiarize > yourself with it if you want to root cause such issues as well. To use > it: > > client: yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06) | tr \\n \\0 \ > | head -c 1G | nc <server ip> -p 5224 > server: ncdevmem -s <server IP> -c <client IP> -f eth1 -l -p 5224 -v 7 > > If you see a validation error with your missing frag, send me the > logs, I may be able to guess what's wrong. Ack, let's put this discussion on the stack and I'll resurrect it once we have something upstream that's mimicking whatever I have on my side in terms of generic devmem tx + device drivers :-) > > > Also, the information is useless to the user. If the user sees 'frag > > > 128 failed to free'. There is nothing really the user can do to > > > recover at runtime. Only usefulness that could come is for the user to > > > log the error. We already WARN_ON_ONCE on the error the user would not > > > be able to trigger. > > > > I'm thinking from the pow of user application. It might have bugs as > > well and try to refill something that should not have been refilled. > > Having info about which particular token has failed (even just for > > the logging purposes) might have been nice. > > Yeah, it may have been nice. On the flip side it complicates calling > sock_devmem_dontneed(). The userspace need to count the freed frags in > its input, remove them, skip the leaked one, and re-call the syscall. > On the flipside the userspace gets to know the id of the frag that > leaked but the usefulness of the information is slightly questionable > for me. :shrug: Right, because I was gonna suggest for this patch, instead of having a separate extra loop that returns -E2BIG (since this loop is basically mostly cycles wasted assuming most of the calls are gonna be well behaved), can we keep num_frags freed as we go and stop and return once we reach MAX_DONTNEED_FRAGS? for (i = 0; i < num_tokens; i++) { for (j ...) { netmem_ref netmem ... ... } num_frags += tokens[i].token_count; if (num_frags > MAX_DONTNEED_FRAGS) return ret; } Or do you still find it confusing because userspace has to handle that?