Zijian Zhang wrote: > > On 11/7/24 8:04 PM, Cong Wang wrote: > > On Thu, Oct 17, 2024 at 12:57:42AM +0000, zijianzhang@xxxxxxxxxxxxx wrote: > >> From: Zijian Zhang <zijianzhang@xxxxxxxxxxxxx> > >> > >> Although we sk_rmem_schedule and add sk_msg to the ingress_msg of sk_redir > >> in bpf_tcp_ingress, we do not update sk_rmem_alloc. As a result, except > >> for the global memory limit, the rmem of sk_redir is nearly unlimited. > >> > >> Thus, add sk_rmem_alloc related logic to limit the recv buffer. > >> > >> Signed-off-by: Zijian Zhang <zijianzhang@xxxxxxxxxxxxx> > >> --- > >> include/linux/skmsg.h | 11 ++++++++--- > >> net/core/skmsg.c | 6 +++++- > >> net/ipv4/tcp_bpf.c | 4 +++- > >> 3 files changed, 16 insertions(+), 5 deletions(-) > >> > >> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h > >> index d9b03e0746e7..2cbe0c22a32f 100644 > >> --- a/include/linux/skmsg.h > >> +++ b/include/linux/skmsg.h > >> @@ -317,17 +317,22 @@ static inline void sock_drop(struct sock *sk, struct sk_buff *skb) > >> kfree_skb(skb); > >> } > >> > >> -static inline void sk_psock_queue_msg(struct sk_psock *psock, > >> +static inline bool sk_psock_queue_msg(struct sk_psock *psock, > >> struct sk_msg *msg) > >> { > >> + bool ret; > >> + > >> spin_lock_bh(&psock->ingress_lock); > >> - if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) > >> + if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) { > >> list_add_tail(&msg->list, &psock->ingress_msg); > >> - else { > >> + ret = true; > >> + } else { > >> sk_msg_free(psock->sk, msg); > >> kfree(msg); > >> + ret = false; > >> } > >> spin_unlock_bh(&psock->ingress_lock); > >> + return ret; > >> } > >> > >> static inline struct sk_msg *sk_psock_dequeue_msg(struct sk_psock *psock) > >> diff --git a/net/core/skmsg.c b/net/core/skmsg.c > >> index b1dcbd3be89e..110ee0abcfe0 100644 > >> --- a/net/core/skmsg.c > >> +++ b/net/core/skmsg.c > >> @@ -445,8 +445,10 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg, > >> if (likely(!peek)) { > >> sge->offset += copy; > >> sge->length -= copy; > >> - if (!msg_rx->skb) > >> + if (!msg_rx->skb) { > >> sk_mem_uncharge(sk, copy); > >> + atomic_sub(copy, &sk->sk_rmem_alloc); > >> + } > >> msg_rx->sg.size -= copy; > >> > >> if (!sge->length) { > >> @@ -772,6 +774,8 @@ static void __sk_psock_purge_ingress_msg(struct sk_psock *psock) > >> > >> list_for_each_entry_safe(msg, tmp, &psock->ingress_msg, list) { > >> list_del(&msg->list); > >> + if (!msg->skb) > >> + atomic_sub(msg->sg.size, &psock->sk->sk_rmem_alloc); > >> sk_msg_free(psock->sk, msg); > > > > Why not calling this atomic_sub() in sk_msg_free_elem()? > > > > Thanks. > > sk_msg_free_elem called by sk_msg_free or sk_msg_free_no_charge will > be invoked in multiple locations including TX/RX/Error and etc. > > We should call atomic_sub(&sk->sk_rmem_alloc) for sk_msgs that have > been atomic_add before. In other words, we need to call atomic_sub > only for sk_msgs in ingress_msg. > > As for "!msg->skb" check here, I want to make sure the sk_msg is not > from function sk_psock_skb_ingress_enqueue, because these sk_msgs' > rmem accounting has already handled by skb_set_owner_r in function > sk_psock_skb_ingress. > Assuming I read above correct this is only an issue when doing a redirect to ingress of a socket? The other path where we do a SK_PASS directly from the verdict to hit the ingress of the current socket is OK because all account is done already through skb. Basically that is what the above explanation for !msg->skb is describing? Can we add this to the description so we don't forget it and/or have to look up the mailing list in the future. Thanks, John