Re: [PATCH bpf-next 00/10] xsk: stop softirq processing on full XSK Rx queue

Maxim Mikityanskiy <maximmi@xxxxxxxxxx> · Fri, 8 Apr 2022 15:48:44 +0300

On 2022-04-08 12:08, Maciej Fijalkowski wrote:
On Thu, Apr 07, 2022 at 01:49:02PM +0300, Maxim Mikityanskiy wrote:
On 2022-04-05 14:06, Maciej Fijalkowski wrote:
Hi!

This is a revival of Bjorn's idea [0] to break NAPI loop when XSK Rx
queue gets full which in turn makes it impossible to keep on
successfully producing descriptors to XSK Rx ring. By breaking out of
the driver side immediately we will give the user space opportunity for
consuming descriptors from XSK Rx ring and therefore provide room in the
ring so that HW Rx -> XSK Rx redirection can be done.

Maxim asked and Jesper agreed on simplifying Bjorn's original API used
for detecting the event of interest, so let's just simply check for
-ENOBUFS within Intel's ZC drivers after an attempt to redirect a buffer
to XSK Rx. No real need for redirect API extension.

Hey Maxim!

I believe some of the other comments under the old series [0] might be still
relevant:

1. need_wakeup behavior. If need_wakeup is disabled, the expected behavior
is busy-looping in NAPI, you shouldn't break out early, as the application
does not restart NAPI, and the driver restarts it itself, leading to a less
efficient loop. If need_wakeup is enabled, it should be set on ENOBUFS - I
believe this is the case here, right?

Good point. We currently set need_wakeup flag for -ENOBUFS case as it is
being done for failure == true. You are right that we shouldn't be
breaking the loop on -ENOBUFS if need_wakeup flag is not set on xsk_pool,
will fix!

2. 50/50 AF_XDP and XDP_TX mix usecase. By breaking out early, you prevent
further packets from being XDP_TXed, leading to unnecessary latency
increase. The new feature should be opt-in, otherwise such usecases suffer.

Anyone performing a lot of XDP_TX (or XDP_PASS, etc) should be using the
regular copy-mode driver, while the zero-copy driver should be used when most
packets are sent up to user-space.

You generalized that easily, but how can you be so sure that all mixed 
use cases can live with the much slower copy mode? Also, how do you 
apply your rule of thumb to the 75/25 AF_XDP/XDP_TX use case? It's both 
"a lot of XDP_TX" and "most packets are sent up to user-space" at the 
same time.

At the moment, the application is free to decide whether it wants 
zerocopy XDP_TX or zerocopy AF_XDP, depending on its needs. After your 
patchset the only valid XDP verdict on zerocopy AF_XDP basically becomes 
"XDP_REDIRECT to XSKMAP". I don't think it's valid to break an entire 
feature to speed up some very specific use case.

Moreover, in early days of AF_XDP there was an attempt to implement 
zerocopy XDP_TX on AF_XDP queues, meaning both XDP_TX and AF_XDP could 
be zerocopy. The implementation suffered from possible overflows in 
driver queues, thus wasn't upstreamed, but it's still a valid idea that 
potentially could be done if overflows are worked around somehow.

For the zero-copy driver, this opt in is not
necessary. But it sounds like a valid option for copy mode, though could we
think about the proper way as a follow up to this work?

My opinion is that the knob has to be part of initial submission, and 
the new feature should be disabled by default, otherwise we have huge 
issues with backward compatibility (if we delay it, the next update 
changes the behavior, breaking some existing use cases, and there is no 
way to work around it).

3. When the driver receives ENOBUFS, it has to drop the packet before
returning to the application. It would be better experience if your feature
saved all N packets from being dropped, not just N-1.

Sure, I'll re-run tests and see if we can omit freeing the current
xdp_buff and ntc bump, so that we would come back later on to the same
entry.

4. A slow or malicious AF_XDP application may easily cause an overflow of
the hardware receive ring. Your feature introduces a mechanism to pause the
driver while the congestion is on the application side, but no symmetric
mechanism to pause the application when the driver is close to an overflow.
I don't know the behavior of Intel NICs on overflow, but in our NICs it's
considered a critical error, that is followed by a recovery procedure, so
it's not something that should happen under normal workloads.

I'm not sure I follow on this one. Feature is about overflowing the XSK
receive ring, not the HW one, right?

Right. So we have this pipeline of buffers:

NIC--> [HW RX ring] --NAPI--> [XSK RX ring] --app--> consumes packets

Currently, when the NIC puts stuff in HW RX ring, NAPI always runs and 
drains it either to XSK RX ring or to /dev/null if XSK RX ring is full. 
The driver fulfills its responsibility to prevent overflows of HW RX 
ring. If the application doesn't consume quick enough, the frames will 
be leaked, but it's only the application's issue, the driver stays 
consistent.

After the feature, it's possible to pause NAPI from the userspace 
application, effectively disrupting the driver's consistency. I don't 
think an XSK application should have this power.

Driver picks entries from fill ring
that were produced by app, so if app is slow on producing those I believe
this would be rather an underflow of ring, we would simply receive less
frames. For HW Rx ring actually being full, I think that HW would be
dropping the incoming frames, so I don't see the real reason to treat this
as critical error that needs to go through recovery.

I'll double check regarding the hardware behavior, but it is what it is. 
If an overflow moves the queue to the fault state and requires a 
recovery, there is nothing I can do about that.

A few more thoughts I just had: mlx5e shares the same NAPI instance to 
serve all queues in a channel, that includes the XSK RQ and the regular 
RQ. The regular and XSK traffic can be configured to be isolated to 
different channels, or they may co-exist on the same channel. If they 
co-exist, and XSK asks to pause NAPI, the regular traffic will still run 
NAPI and drop 1 XSK packet per NAPI cycle, unless point 3 is fixed. It 
can also be reproduced if NAPI is woken up by XSK TX. Besides, (correct 
me if I'm wrong) your current implementation introduces extra latency to 
XSK TX if XSK RX asked to pause NAPI, because NAPI will be restarted 
anyway (by TX wakeup), and it could have been rescheduled by the kernel.

Am I missing something? Maybe I have just misunderstood you.

One might ask why it is still relevant even after having proper busy
poll support in place - here is the justification.

For xdpsock that was:
- run for l2fwd scenario,
- app/driver processing took place on the same core in busy poll
    with 2048 budget,
- HW ring sizes Tx 256, Rx 2048,

this work improved throughput by 78% and reduced Rx queue full statistic
bump by 99%.

For testing ice, make sure that you have [1] present on your side.

This set, besides the work described above, also carries also
improvements around return codes in various XSK paths and lastly a minor
optimization for xskq_cons_has_entries(), a helper that might be used
when XSK Rx batching would make it to the kernel.

Regarding error codes, I would like them to be consistent across all
drivers, otherwise all the debuggability improvements are not useful enough.
Your series only changed Intel drivers. Here also applies the backward
compatibility concern: the same error codes than were in use have been
repurposed, which may confuse some of existing applications.

I'll double check if ZC drivers are doing something unusual with return
values from xdp_do_redirect(). Regarding backward comp, I suppose you
refer only to changes in ndo_xsk_wakeup() callbacks as others are not
exposed to user space? They're not crucial to me, but it improved my
debugging experience.

Sorry if I wasn't clear enough. Yes, I meant the wakeup error codes. We 
aren't doing anything unusual with xdp_do_redirect codes (can't say for 
other drivers, though).

Last time I wanted to improve error codes returned from some BPF helpers 
(make the errors more distinguishable), my patch was blocked because of 
backward compatibility concerns. To be on the safe side (i.e. to avoid 
further bug reports from someone who actually relied on specific codes), 
you might want to use a new error code, rather than repurposing the 
existing ones.

I personally don't have objections about changing the error codes the 
way you did if you keep them consistent across all drivers, not only 
Intel ones.

FYI, your mail landed in my junk folder

That has to be something with your email server. I just sent an email to 
gmail, and it arrived to inbox. If anyone else (other than @intel.com) 
can't receive my emails, please tell me, I'll open a support ticket then.

and the links [0] [1] are messed up in
the reply you sent. And this is true even on lore.kernel.org. They suddenly
refer to "nam11.safelinks.protection.outlook.com".

I'm afraid I can't do anything with these links. Our outlook server 
mangles them in all incoming letters as soon as they arrive. The letter 
I received from you already has them "sanitized".

Maybe something worth
looking into if you have this problem in the future.

Thanks!
MF

[0]: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fbpf%2F20200904135332.60259-1-bjorn.topel%40gmail.com%2F&amp;data=04%7C01%7Cmaximmi%40nvidia.com%7Cc9cefaa3a1cd465ccdb908da16f45eaf%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637847536077594100%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=sLpTcgboo9YU55wtUtaY1%2F%2FbeiYxeWP5ubk%2FQ6X8vB8%3D&amp;reserved=0
[1]: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fnetdev%2F20220317175727.340251-1-maciej.fijalkowski%40intel.com%2F&amp;data=04%7C01%7Cmaximmi%40nvidia.com%7Cc9cefaa3a1cd465ccdb908da16f45eaf%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637847536077594100%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=OWXeZhc2Nouz%2FTMWBxvtTYbw%2FS8HWQfbfEqnVc5478k%3D&amp;reserved=0

Björn Töpel (1):
    xsk: improve xdp_do_redirect() error codes

Maciej Fijalkowski (9):
    xsk: diversify return codes in xsk_rcv_check()
    ice: xsk: terminate NAPI when XSK Rx queue gets full
    i40e: xsk: terminate NAPI when XSK Rx queue gets full
    ixgbe: xsk: terminate NAPI when XSK Rx queue gets full
    ice: xsk: diversify return values from xsk_wakeup call paths
    i40e: xsk: diversify return values from xsk_wakeup call paths
    ixgbe: xsk: diversify return values from xsk_wakeup call paths
    ice: xsk: avoid refilling single Rx descriptors
    xsk: drop ternary operator from xskq_cons_has_entries

   .../ethernet/intel/i40e/i40e_txrx_common.h    |  1 +
   drivers/net/ethernet/intel/i40e/i40e_xsk.c    | 27 +++++++++------
   drivers/net/ethernet/intel/ice/ice_txrx.h     |  1 +
   drivers/net/ethernet/intel/ice/ice_xsk.c      | 34 ++++++++++++-------
   .../ethernet/intel/ixgbe/ixgbe_txrx_common.h  |  1 +
   drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  | 29 ++++++++++------
   net/xdp/xsk.c                                 |  4 +--
   net/xdp/xsk_queue.h                           |  4 +--
   8 files changed, 64 insertions(+), 37 deletions(-)