Re: [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Tue, 11 Jul 2023 19:37:23 -0700

On Tue, Jul 11, 2023 at 5:32 PM Jakub Kicinski <kuba@xxxxxxxxxx> wrote:
>
> On Tue, 11 Jul 2023 15:56:57 -0700 Alexei Starovoitov wrote:
> > I think this proves my point: csum is not generalizable even across veth and mlx5.
> > Above is a square peg that tries to fit csum_start/offset api (that makes sense from SW pov)
> > into HW that has different ideas about csum-ing.
> >
> > Here is what mlx5 does:
> > mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> >                             struct mlx5e_accel_tx_state *accel,
> >                             struct mlx5_wqe_eth_seg *eseg)
> > {
> >         if (unlikely(mlx5e_ipsec_txwqe_build_eseg_csum(sq, skb, eseg)))
> >                 return;
> >
> >         if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
> >                 eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
> >                 if (skb->encapsulation) {
>
> This should be irrelevant today, as LCO exists?

Hmm. Maybe. But LCO is an example that prog devs have to be aware of
and use it properly.
Meaning for certain protocols compute outer csum LCO way and
let inner go through HW csuming.
In this case I have no idea what these mlx5 flags do.
I hope this part of the code was tested with udp tunnels.

> >                         eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
> >                                           MLX5_ETH_WQE_L4_INNER_CSUM;
> >                         sq->stats->csum_partial_inner++;
> >                 } else {
> >                         eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
> >                         sq->stats->csum_partial++;
> >                 }
> >
> > How would you generalize that into csum api that will work across NICs ?
> >
> > My answer stands: you cannot.
> >
> > My proposal again:
> > add driver specifc kfuncs and hooks for things like csum.
> >
> > Kuba,
> > since you nacked driver specific stuff please suggest a way to unblock this stalemate.
>
> I hope I'm not misremembering but I think I suggested at the beginning
> to create a structure describing packet geometry and requested offloads,
> and for the prog fill that in.

hmm. but that's what skb is for. skb == packet geometry ==
layout of headers, payload, inner vs outer, csum partial, gso params.

bpf at tc layer supposed to interact with that correctly.
If the packet is modified skb geometry should be adjusted accordingly.
Like BPF_F_RECOMPUTE_CSUM flag in bpf_skb_store_bytes().

>
> All operating systems I know end up doing that, we'll end up doing
> that as well. The question is whether we're willing to learn from
> experience or prefer to go on a wild ride first...

I don't follow. This thread was aimed to add xdp layer knobs.
To me XDP is a driver level. 'struct xdp_md' along with
BPF_F_XDP_HAS_FRAGS is the best abstraction we could do generalizing
dma-buffers (page and multi-page) that drivers operate on.
Everything else at driver level is too unique to generalize.
skb layer is already doing its job.

In that sense "generic XDP" is a feature for testing only.
Trying to make "generic XDP" fast is missing the point of XDP.

AF_XDP is a different concept. Exposing timestamp,
csum, TSO to AF_XDP users is a different design challenge.
I'm all for doing that, but trying to combine "timestamp in xdp tx"
and "timestamp in AF_XDP" will lead to bad trade-off-s for both.
Which I think this patchset demonstrates.