Cong Wang wrote: > From: Cong Wang <cong.wang@xxxxxxxxxxxxx> > > Yucong noticed we can't poll() sockets in sockmap even > when they are the destination sockets of redirections. > This is because we never poll any psock queues in ->poll(). > We can not overwrite ->poll() as it is in struct proto_ops, > not in struct proto. > > So introduce sk_msg_poll() to poll psock ingress_msg queue > and let sockets which support sockmap invoke it directly. > > Reported-by: Yucong Sun <sunyucong@xxxxxxxxx> > Cc: John Fastabend <john.fastabend@xxxxxxxxx> > Cc: Daniel Borkmann <daniel@xxxxxxxxxxxxx> > Cc: Jakub Sitnicki <jakub@xxxxxxxxxxxxxx> > Cc: Lorenz Bauer <lmb@xxxxxxxxxxxxxx> > Signed-off-by: Cong Wang <cong.wang@xxxxxxxxxxxxx> > --- > include/linux/skmsg.h | 6 ++++++ > net/core/skmsg.c | 15 +++++++++++++++ > net/ipv4/tcp.c | 2 ++ > net/ipv4/udp.c | 2 ++ > net/unix/af_unix.c | 5 +++++ > 5 files changed, 30 insertions(+) > [...] struct sk_buff *skb) > { > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index e8b48df73c85..2eb1a87ba056 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -280,6 +280,7 @@ > #include <linux/uaccess.h> > #include <asm/ioctls.h> > #include <net/busy_poll.h> > +#include <linux/skmsg.h> > > /* Track pending CMSGs. */ > enum { > @@ -563,6 +564,7 @@ __poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait) > > if (tcp_stream_is_readable(sk, target)) > mask |= EPOLLIN | EPOLLRDNORM; > + mask |= sk_msg_poll(sk); > > if (!(sk->sk_shutdown & SEND_SHUTDOWN)) { > if (__sk_stream_is_writeable(sk, 1)) { For TCP we implement the stream_memory_read() hook which we implement in tcp_bpf.c with tcp_bpf_stream_read. This just checks psock->ingress_msg list which should cover any redirect from skmsg into the ingress side of another socket. And the tcp_poll logic is using tcp_stream_is_readable() which is checking for sk->sk_prot->stream_memory_read() and then calling it. The straight receive path, e.g. not redirected from a sender should be covered by the normal tcp_epollin_ready() checks because this would be after TCP does the normal updates to rcv_nxt, copied_seq, etc. So above is not in the TCP case by my reading. Did I miss a case? We also have done tests with Envoy which I thought were polling so I'll check on that as well. Thanks, John