Re: [PATCH v2 bpf-next 00/14] xsk: stop NAPI Rx processing on full XSK RQ

Maxim Mikityanskiy <maximmi@xxxxxxxxxx> · Thu, 25 Aug 2022 14:42:25 +0000

On Tue, 2022-08-23 at 22:18 -0700, John Fastabend wrote:
> Maxim Mikityanskiy wrote:
> > On Tue, 2022-08-23 at 13:24 +0200, Maciej Fijalkowski wrote:
> > > On Tue, Aug 23, 2022 at 09:49:43AM +0000, Maxim Mikityanskiy wrote:
> > > > Anyone from Intel? Maciej, Björn, Magnus?
> > > 
> > > Hey Maxim,
> > > 
> > > how about keeping it simple and going with option 1? This behavior was
> > > even proposed in the v1 submission of the patch set we're talking about.
> > 
> > Yeah, I know it was the behavior in v1. It was me who suggested not
> > dropping that packet, and I didn't realize back then that it had this
> > undesired side effect - sorry for that!
> 
> Just want to reiterate what was said originally, you'll definately confuse
> our XDP programs if they ever saw the same pkt twice. It would confuse
> metrics and any "tap" and so on.
> 
> > 
> > Option 1 sounds good to me as the first remedy, we can start with that.
> > 
> > However, it's not perfect: when NAPI and the application are pinned to
> > the same core, if the fill ring is bigger than the RX ring (which makes
> > sense in case of multiple sockets on the same UMEM), the driver will
> > constantly get into this condition, drop one packet, yield to
> > userspace, the application will of course clean up the RX ring, but
> > then the process will repeat.
> 
> Maybe dumb question haven't followed the entire thread or at least
> don't recall it. Could you yield when you hit a high water mark at
> some point before pkt drop?

That would be an ideal solution, but it doesn't sound feasible to me.
There may be multiple XSK sockets on the same RX queue, and each socket
has its own RX ring, which can be full or have some space. We don't
know in advance which RX ring will be chosen by the XDP program (if any
at all; the XDP program can still drop, pass or retransmit the packet),
so we can't check the watermark on a specific ring before running XDP.

It may be possible to iterate over all sockets and stop the processing
if any socket's RX ring is full, but that would be a heavy thing to do
per packet. It should be possible to optimize it to run once per NAPI
cycle, checking that each RX ring has enough space to fit the whole
NAPI budget, but I'm still not sure of performance implications, and it
sounds overly protective.

> 
> > 
> > That means, we'll always have a small percentage of packets dropped,
> > which may trigger the congestion control algorithms on the other side,
> > slowing down the TX to unacceptable speeds (because packet drops won't
> > disappear after slowing down just a little).
> > 
> > Given the above, we may need a more complex solution for the long term.
> > What do you think?
> > 
> > Also, if the application uses poll(), this whole logic (either v1 or
> > v2) seems not needed, because poll() returns to the application when
> > something becomes available in the RX ring, but I guess the reason for
> > adding it was that fantastic 78% performance improvement mentioned in
> > the cover letter?