Re: [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs

Willem de Bruijn <willemdebruijn.kernel@xxxxxxxxx> · Wed, 12 Jul 2023 15:11:23 -0400

On Wed, Jul 12, 2023 at 3:03 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Wed, Jul 12, 2023 at 11:16:04AM -0400, Willem de Bruijn wrote:
> > On Wed, Jul 12, 2023 at 1:36 AM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote:
> > >
> > > On Tue, Jul 11, 2023 at 9:59 PM Alexei Starovoitov
> > > <alexei.starovoitov@xxxxxxxxx> wrote:
> > > >
> > > > On Tue, Jul 11, 2023 at 8:29 PM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote:
> > > > >
> > > > >
> > > > > This will slow things down, but not to the point where it's on par
> > > > > with doing sw checksum. At least in theory.
> > > > > We can't stay at skb when using AF_XDP. AF_XDP would benefit from having
> > > > > the offloads.
> > > >
> > > > To clarify: yes, AF_XDP needs generalized HW offloads.
> > >
> > > Great! To reiterate, I'm mostly interested in af_xdp wrt tx
> > > timestamps. So if the consensus is not to mix xdp-tx and af_xdp-tx,
> > > I'm fine with switching to adding some fixed af_xdp descriptor format
> > > to enable offloads on tx.
>
> since af_xdp is a primary user let's figure out what is the best api for that.
> If any code can be salvaged for xdp tx, great, but let's not start with xdp tx
> as prerequisite.
>
> > >
> > > > I just don't see how xdp tx offloads are moving a needle in that direction.
> > >
> > > Let me try to explain how both might be similar, maybe I wasn't clear
> > > enough on that.
> > > For af_xdp tx packet, the userspace puts something in the af_xdp frame
> > > metadata area (headrom) which then gets executed/interpreted by the
> > > bpf program at devtx (which calls kfuncs to enable particular
> > > offloads).
> > > IOW, instead of defining some fixed layout for the tx offloads, the
> > > userspace and bpf program have some agreement on the layout (and bpf
> > > program "applies" the offloads by calling the kfuncs).
> > > Also (in theory) the same hooks can be used for xdp-tx.
> > > Does it make sense? But, again, happy to scratch that whole idea if
> > > we're fine with a fixed layout for af_xdp.
>
> So instead of defining csum offload format in xsk metadata we'll
> defining it as a set of arguments to a kfunc and tx-side xsk prog
> will just copy the args from metadata into kfunc args ?
> Seems like an unnecesary step. Such xsk prog won't be doing
> anything useful. Just copying from one place to another.
> It seems the only purpose of such bpf prog is to side step uapi exposure.
> bpf is not used to program anything. There won't be any control flow.
> Just odd intermediate copy step.
> Instead we can define a metadata struct for csum nic offload
> outside of uapi/linux/if_xdp.h with big 'this is not an uapi' warning.
> User space can request it via setsockopt.
> And probably feature query the nic via getsockopt.
>
> Error handling is critical here. With xsk tx prog the errors
> are messy. What to do when kfunc returns error? Store it back into
> packet metadata ? and then user space needs to check every single
> packet for errors? Not practical imo.
>
> Feature query via getsockopt would be done once instead and
> user space will fill in "csum offload struct" in packet metadata
> and won't check per-packet error. If driver said the csum feature
> is available it's better work for every packet.
> Notice mlx5e_txwqe_build_eseg_csum() returns void.
>
> >
> > Checksum offload is an important demonstrator too.
> >
> > It is admittedly a non-trivial one. Checksum offload has often been
> > discussed as a pain point ("protocol ossification").
> >
> > In general, drivers can accept every CHECKSUM_COMPLETE skb that
> > matches their advertised feature NETIF_F_[HW|IP|IPV6]_CSUM. I don't
> > see why this would be different for kfuncs for packets coming from
> > userspace.
> >
> > The problematic drivers are the ones that do not implement
> > CHECKSUM_COMPLETE as intended, but ignore this simple
> > protocol-independent hint in favor of parsing from scratch, possibly
> > zeroing the field, computing multiple layers, etc.
> >
> > All of which is unnecessary with LCO. An AF_XDP user can be expected
> > to apply LCO and only request checksum insertion for the innermost
> > checksum.
> >
> > The biggest problem is with these devices that parse in hardware (and
> > possibly also in the driver to identify and fix up hardware
> > limitations) is that they will fail if encountering an unknown
> > protocol. Which brings us to advertising limited typed support:
> > NETIF_F_HW_CSUM vs NETIF_F_IP_CSUM.
> >
> > The fact that some devices that deviate from industry best practices
> > cannot support more advanced packet formats is unfortunate, but not a
> > reason to hold others back. No different from current kernel path. The
> > BPF program can fallback onto software checksumming on these devices,
> > like the kernel path. Perhaps we do need to pass along with csum_start
> > and csum_off a csum_type that matches the existing
> > NETIF_F_[HW|IP|IPV6]_CSUM, to let drivers return with -EOPNOTSUPP
> > quickly if for the generic case.
> >
> > For implementation in essence it is just reordering driver code that
> > already exists for the skb case. I think the ice patch series to
> > support rx timestamping is a good indication of what it takes to
> > support XDP kfuncs: not so much new code, but reordering the driver
> > logic.
> >
> > Which also indicates to me that the driver *is* the right place to
> > implement this logic, rather than reimplement it in a BPF library. It
> > avoids both code duplication and dependency hell, if the library ships
> > independent from the driver.
>
> Agree with all of the above.
> I think defining CHECKSUM_PARTIAL struct request for af_xdp is doable and
> won't require much changes in the drivers.
> If we do it for more than one driver from the start there is a chance it
> will work for other drivers too. imo ice+gve+mlx5 would be enough.

Basically, add to AF_XDP what we already have for its predecessor
AF_PACKET: setsockopt PACKET_VNET_HDR?

Possibly with a separate new struct, rather than virtio_net_hdr. As
that has dependencies on other drivers, notably virtio and its
specification process.