On Mon, Oct 11, 2021 at 09:16 PM CEST, John Fastabend wrote: > A socket in a sockmap may have different combinations of programs > attached depending on configuration. There can be no programs in which > case the socket acts as a sink only. There can be a TX program in this > case a BPF program is attached to sending side, but no RX program is > attached. There can be an RX program only where sends have no BPF > program attached, but receives are hooked with BPF. And finally, > both TX and RX programs may be attached. Giving us the permutations, > > None, Tx, Rx, and TxRx > > To date most of our use cases have been TX case being used as a fast > datapath to directly copy between local application and a userspace > proxy. Or Rx cases and TxRX applications that are operating an in > kernel based proxy. The traffic in the first case where we hook > applications into a userspace application looks like this, > > AppA redirect AppB > Tx <-----------> Rx > | | > + + > TCP <--> lo <--> TCP > > In this case all traffic from AppA (after 3whs) is copied into the > AppB ingress queue and no traffic is ever on the TCP recieive_queue. > > In the second case the application never receives, except in some > rare error cases, traffic on the actual user space socket. Instead > the send happens in the kernel. > > AppProxy socket pool > sk0 ------------->{sk1,sk2, skn} > ^ | > | | > | v > ingress lb egress > TCP TCP > > Here because traffic is never read off the socket with userspace > recv() APIs there is only ever one reader on the sk receive_queue. > Namely the BPF programs. > > However, we've started to introduce a third configuration where the > BPF program on receive should process the data, but then the normal > case is to push the data into the receive queue of AppB. > > AppB > recv() (userspace) > ----------------------- > tcp_bpf_recvmsg() (kernel) > | | > | | > | | > ingress_msgQ | > | | > RX_BPF | > | | > v v > sk->receive_queue > > > This is different from the App{A,B} redirect because traffic is > first received on the sk->receive_queue. > > Now for the issue. The tcp_bpf_recvmsg() handler first checks the > ingress_msg queue for any data handled by the BPF rx program and > returned with PASS code so that it was enqueued on the ingress msg > queue. Then if no data exists on that queue it checks the socket > receive queue. Unfortunately, this is the same receive_queue the > BPF program is reading data off of. So we get a race. Its possible > for the recvmsg() hook to pull data off the receive_queue before > the BPF hook has a chance to read it. It typically happens when > an application is banging on recv() and getting EAGAINs. Until > they manage to race with the RX BPF program. > > To fix this we note that before this patch at attach time when > the socket is loaded into the map we check if it needs a TX > program or just the base set of proto bpf hooks. Then it uses > the above general RX hook regardless of if we have a BPF program > attached at rx or not. This patch now extends this check to > handle all cases enumerated above, TX, RX, TXRX, and none. And > to fix above race when an RX program is attached we use a new > hook that is nearly identical to the old one except now we > do not let the recv() call skip the RX BPF program. Now only > the BPF program pulls data from sk->receive_queue and recv() > only pulls data from the ingress msgQ post BPF program handling. > > With this resolved our AppB from above has been up and running > for many hours without detecting any errors. We do this by > correlating counters in RX BPF events and the AppB to ensure > data is never skipping the BPF program. Selftests, was not > able to detect this because we only run them for a short > period of time on well ordered send/recvs so we don't get any > of the noise we see in real application environments. > > Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path") > Signed-off-by: John Fastabend <john.fastabend@xxxxxxxxx> > --- Acked-by: Jakub Sitnicki <jakub@xxxxxxxxxxxxxx>