John Fastabend wrote: > Applications can be confused slightly because we do not always return the > same error code as expected, e.g. what the TCP stack normally returns. For > example on a sock err sk->sk_err instead of returning the sock_error we > return EAGAIN. This usually means the application will 'try again' > instead of aborting immediately. Another example, when a shutdown event > is received we should immediately abort instead of waiting for data when > the user provides a timeout. > > These tend to not be fatal, applications usually recover, but introduces > bogus errors to the user or introduces unexpected latency. Before > 'c5d2177a72a16' we fell back to the TCP stack when no data was available > so we managed to catch many of the cases here, although with the extra > latency cost of calling tcp_msg_wait_data() first. > > To fix lets duplicate the error handling in TCP stack into tcp_bpf so > that we get the same error codes. > > These were found in our CI tests that run applications against sockmap > and do longer lived testing, at least compared to test_sockmap that > does short-lived ping/pong tests, and in some of our test clusters > we deploy. > > Its non-trivial to do these in a shorter form CI tests that would be > appropriate for BPF selftests, but we are looking into it so we can > ensure this keeps working going forward. As a preview one idea is to > pull in the packetdrill testing which catches some of this. > > Fixes: c5d2177a72a16 ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self") > Signed-off-by: John Fastabend <john.fastabend@xxxxxxxxx> > --- Forgot to add a note, I marked this for bpf-next given we are in rc8. It is a fix though, but assume we only want critical things at this point. Anyways it applies against bpf and bpf-next so can be applied in either place. Thanks, John