On Mon, Mar 27, 2023 at 10:54 AM -07, John Fastabend wrote: > We noticed some rare sk_buffs were stepping past the queue when system was > under memory pressure. The general theory is to skip enqueueing > sk_buffs when its not necessary which is the normal case with a system > that is properly provisioned for the task, no memory pressure and enough > cpu assigned. > > But, if we can't allocate memory due to an ENOMEM error when enqueueing > the sk_buff into the sockmap receive queue we push it onto a delayed > workqueue to retry later. When a new sk_buff is received we then check > if that queue is empty. However, there is a problem with simply checking > the queue length. When a sk_buff is being processed from the ingress queue > but not yet on the sockmap msg receive queue its possible to also recv > a sk_buff through normal path. It will check the ingress queue which is > zero and then skip ahead of the pkt being processed. > > Previously we used sock lock from both contexts which made the problem > harder to hit, but not impossible. > > To fix also check the 'state' variable where we would cache partially > processed sk_buff. This catches the majority of cases. But, we also > need to use the mutex lock around this check because we can't have both > codes running and check sensibly. We could perhaps do this with atomic > bit checks, but we are already here due to memory pressure so slowing > things down a bit seems OK and simpler to just grab a lock. > > To reproduce issue we run NGINX compliance test with sockmap running and > observe some flakes in our testing that we attributed to this issue. > > Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") > Tested-by: William Findlay <will@xxxxxxxxxxxxx> > Signed-off-by: John Fastabend <john.fastabend@xxxxxxxxx> > --- I've got an idea to try, but it'd a bigger change. skb_dequeue is lock, skb_peek, skb_unlink, unlock, right? What if we split up the skb_dequeue in sk_psock_backlog to publish the change to the ingress_skb queue only once an skb has been processed? static void sk_psock_backlog(struct work_struct *work) { ... while ((skb = skb_peek_locked(&psock->ingress_skb))) { ... skb_unlink(skb, &psock->ingress_skb); } ... } Even more - if we hold off the unlinking until an skb has been fully processed, that perhaps opens up the way to get rid of keeping state in sk_psock_work_state. We could just skb_pull the processed data instead. It's just an idea and I don't want to block a tested fix that LGTM so consider this: Reviewed-by: Jakub Sitnicki <jakub@xxxxxxxxxxxxxx>