Re: Sockmap's parser/verdict programs and epoll notifications

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 3 Oct 2023 15:18:27 -0700

On Mon, Oct 2, 2023 at 10:16 PM John Fastabend <john.fastabend@xxxxxxxxx> wrote:
>
> John Fastabend wrote:
> > Andrii Nakryiko wrote:
> > > On Mon, Sep 11, 2023 at 7:43 AM John Fastabend <john.fastabend@xxxxxxxxx> wrote:
> > > >
> > > > Andrii Nakryiko wrote:
> > > > > On Sun, Jul 16, 2023 at 9:37 PM John Fastabend <john.fastabend@xxxxxxxxx> wrote:
> > > > > >
> > > > > > Andrii Nakryiko wrote:
> > > > > > > Hey John,
> > > > > >
> > > > > > Sorry missed this while I was on PTO that week.
> > > > >
> > > > > yeah, vacations tend to cause missing things :)
> > > > >
> > > > > >
> > > > > > >
> > > > > > > We've been recently experimenting with using BPF_SK_SKB_STREAM_PARSER
> > > > > > > and BPF_SK_SKB_STREAM_VERDICT with sockmap/sockhash to perform
> > > > > > > in-kernel parsing of RSocket frames. A very simple format ([0]) where
> > > > > > > the first 3 bytes specify the size of the frame payload. The idea was
> > > > > > > to collect the entire frame in the kernel before notifying user-space
> > > > > > > that data is available. This is meant to minimize unnecessary wakeups
> > > > > > > due to incomplete logical frames, saving CPU.
> > > > > >
> > > > > > Nice.
> > > > > >
> > > > > > >
> > > > > > > You can find the BPF source code I've used at [1], it has lots of
> > > > > > > extra logging and stuff, but the idea is to read the first 3 bytes of
> > > > > > > each logical frame, and return the expected full frame size from the
> > > > > > > parser program. The verdict program always just returns SK_PASS.
> > > > > > >
> > > > > > > This seems to work exactly as expected in manual simulations of
> > > > > > > various packet size distributions, and even for a bunch of
> > > > > > > ping/pong-like benchmark (which are very sensitive to correct frame
> > > > > > > length determination, so I'm reasonably confident we don't screw that
> > > > > > > up much). And yet, when benchmarking sending multiple logical RPC
> > > > > > > streams over the same single socket (so many interleaving RSocket
> > > > > > > frames on single socket, but in terms of logical frames nothing should
> > > > > > > change), we often see that while full frame hasn't been accumulated in
> > > > > > > socket receive buffer yet, epoll_wait() for that socket would return
> > > > > > > with success notifying user space that there is data on socket.
> > > > > > > Subsequent recvfrom() call would immediately return -EAGAIN and no
> > > > > > > data, and our benchmark would go on this loop of useless
> > > > > > > epoll_wait()+recvfrom() calls back to back, many times over.
> > > > > >
> > > > > > Aha yes this sounds bad.
> > > > > >
> > > > > > >
> > > > > > > So I have a few questions:
> > > > > > >   - is the above use case something that was meant to be handled by
> > > > > > > sockmap+parser/verdict?
> > > > > >
> > > > > > We shouldn't wake up user space if there is nothing to read. So
> > > > > > yes this seems like a valid use case to me.
> > > > > >
> > > > > > >   - is it correct to assume that epoll won't wake up until amount of
> > > > > > > bytes requested by parser program is accumulated (this seems to be the
> > > > > > > case from manually experimenting with various "packet delays");
> > > > > >
> > > > > > Seems there is some bug that races and causes it to wake up
> > > > > > user space. I'm aware of a couple bugs in the stream parser
> > > > > > that I wanted to fix. Not sure I can get to them this week
> > > > > > but should have time next week. We have a couple more fixes
> > > > > > to resolve a few HTTPS server compliance tests as well.
> > > > > >
> > > > > > >   - is there some known bug or race in how sockmap and strparser
> > > > > > > framework interacts with epoll subsystem that could cause this weird
> > > > > > > epoll_wait() behavior?
> > > > > >
> > > > > > Yes I know of some races in strparser. I'll elaborate later
> > > > > > probably with patches as I don't recall them readily at the
> > > > > > moment.
> > > > >
> > > > > So I missed a good chunk of BPF mailing list traffic while I was on my
> > > > > PTO. Did you end up getting to these bugs in strparser logic? Should I
> > > > > try running the latest bpf-next/net-next on our production workload to
> > > > > see if this is still happening?
> > > >
> > > > You will likely still hit there error I haven't got it out of my queue
> > > > yet. I just knocked off a couple things last week so could probably
> > > > take a look at flushing my queue this week. Then it would make sense
> > > > to retest to see if its something new or not.
> > > >
> > > > I'll at least send an RFC with the idea even if I don't get to testing
> > > > it yet.
> > >
> > > Sounds good, thanks a lot!
> > >
> > > >
> > > > Thanks,
> > > > John
> >
> > Hi Andrii,
> >
> > Finally got around to thinking about this. And also I belive we have
> > the verdict programs mostly fixed up to handle polling correctly now.
> > The problem was incorrectly handling the tcp_sock copied_seq var
> > which is used by tcp_epollin_ready() to wakeup the application. Its
> > also used to calculate responses to some ioctl we found servers using
> > to decide when to actually do a recv, e.g. they wait on the ioctl until
> > enough bytes are received.
> >
> > The trick is to ensure we only update copied_seq when the bytes are
> > in fact actually ready to read from socket queue. The sockmap verdict
> > program code was incrementing this before running the verdict prog
> > so we raced with userspace. It kind of works in many cases because
> > we are holding the sock lock in many cases so we block the user space
> > recvmsg.
> >
> > Now to your problem as I understand it. You are trying to use the
> > parser program to hold some N bytes where N is the message block.
> > At which point it will get pushed to a verdict prog and finally
> > queued in the msg recieve queue so a syscall to recv*() can
> > actually read it. The parser program, unlike if you just have
> > a verdict prog, causes the skb to run through the stream parser to
> > collect bytes and then run the verdict program. The stream parser
> > is using tcp_read_sock() which increments the seq_copied immediately
> > even before the  verdict prog is run so I expect the odd behavior
> > you see is when that race completes. It likely  mostly works because
> > we have the sock lock for lots of the code making the race behavior
> > smaller than it might otherwise appear. I didn't do a full anlaysis
> > but it might just be when we hit an ENOMEM condition and need to
> > backoff. Which might explain why you only see the issue when you
> > run with larger envs.
> >
> > It feels a bit suboptimal in your case to run two BPF programs and
> > parser logic compared to a single verdict program. Could we just
> > add a bpf helper we can run from the verdict program to only wake
> > up the user space after N bytes. To mirror the sk_msg programs we
> > migth call it bpf_skb_cork_bytes(skb, bytes, flags). We could use
> > flags to decide if we need to call the prog again with the new
> > full sized skb or if we can just process it directly without the
> > extra BPF call.
> >
> > This with the other piece we want from our side to allow running
> > verdict and sk_msg programs on sockets without having them in a
> > sockmap/sockhash it would seem like a better system to me. The
> > idea to drop the sockmap/sockhash is because we never remove progs
> > once they are added and we add them from sockops side. The filter
> > to socketes is almost always the port + metadata related to the
> > process or environment. This simplifies having to manage the
> > sockmap/sockhash and guess what size it should be. Sometimes we
> > overrun these maps and have to kill connections until we can
> > get more space.
> >
> > For you case I would expect it to be (a) simpler just a single
> > program to manage instead of two and a map and (b) more efficient
> > to call one prog in datapath vs two.
> >
> > WDYT?
> >

Avoiding the need to maintain sockmap/sockhash is a win for sure, and
you are right, that normally once you attach such special
verdict/parser program (usually by port number, which typically
identifies service, right?), you don't detach it until socket is
closed. So yes, absolutely, this seems like a simplification.

> > Thanks,
> > John
>
> On second thought I'll also fix the existing stream parser code here
> shortly. Its a bit broken if I just leave it as is, but I still like
> the idea of a new helper.

Yep, no matter what's the new and better approach, it would be nice to
have existing stuff behave less erratically :) Thanks for taking care
of this!