On Wed, Jul 12, 2023 at 3:03 PM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Wed, Jul 12, 2023 at 11:16:04AM -0400, Willem de Bruijn wrote: > > On Wed, Jul 12, 2023 at 1:36 AM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote: > > > > > > On Tue, Jul 11, 2023 at 9:59 PM Alexei Starovoitov > > > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > > > > > On Tue, Jul 11, 2023 at 8:29 PM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > > This will slow things down, but not to the point where it's on par > > > > > with doing sw checksum. At least in theory. > > > > > We can't stay at skb when using AF_XDP. AF_XDP would benefit from having > > > > > the offloads. > > > > > > > > To clarify: yes, AF_XDP needs generalized HW offloads. > > > > > > Great! To reiterate, I'm mostly interested in af_xdp wrt tx > > > timestamps. So if the consensus is not to mix xdp-tx and af_xdp-tx, > > > I'm fine with switching to adding some fixed af_xdp descriptor format > > > to enable offloads on tx. > > since af_xdp is a primary user let's figure out what is the best api for that. > If any code can be salvaged for xdp tx, great, but let's not start with xdp tx > as prerequisite. > > > > > > > > I just don't see how xdp tx offloads are moving a needle in that direction. > > > > > > Let me try to explain how both might be similar, maybe I wasn't clear > > > enough on that. > > > For af_xdp tx packet, the userspace puts something in the af_xdp frame > > > metadata area (headrom) which then gets executed/interpreted by the > > > bpf program at devtx (which calls kfuncs to enable particular > > > offloads). > > > IOW, instead of defining some fixed layout for the tx offloads, the > > > userspace and bpf program have some agreement on the layout (and bpf > > > program "applies" the offloads by calling the kfuncs). > > > Also (in theory) the same hooks can be used for xdp-tx. > > > Does it make sense? But, again, happy to scratch that whole idea if > > > we're fine with a fixed layout for af_xdp. > > So instead of defining csum offload format in xsk metadata we'll > defining it as a set of arguments to a kfunc and tx-side xsk prog > will just copy the args from metadata into kfunc args ? > Seems like an unnecesary step. Such xsk prog won't be doing > anything useful. Just copying from one place to another. > It seems the only purpose of such bpf prog is to side step uapi exposure. > bpf is not used to program anything. There won't be any control flow. > Just odd intermediate copy step. > Instead we can define a metadata struct for csum nic offload > outside of uapi/linux/if_xdp.h with big 'this is not an uapi' warning. > User space can request it via setsockopt. > And probably feature query the nic via getsockopt. > > Error handling is critical here. With xsk tx prog the errors > are messy. What to do when kfunc returns error? Store it back into > packet metadata ? and then user space needs to check every single > packet for errors? Not practical imo. > > Feature query via getsockopt would be done once instead and > user space will fill in "csum offload struct" in packet metadata > and won't check per-packet error. If driver said the csum feature > is available it's better work for every packet. > Notice mlx5e_txwqe_build_eseg_csum() returns void. > > > > > Checksum offload is an important demonstrator too. > > > > It is admittedly a non-trivial one. Checksum offload has often been > > discussed as a pain point ("protocol ossification"). > > > > In general, drivers can accept every CHECKSUM_COMPLETE skb that > > matches their advertised feature NETIF_F_[HW|IP|IPV6]_CSUM. I don't > > see why this would be different for kfuncs for packets coming from > > userspace. > > > > The problematic drivers are the ones that do not implement > > CHECKSUM_COMPLETE as intended, but ignore this simple > > protocol-independent hint in favor of parsing from scratch, possibly > > zeroing the field, computing multiple layers, etc. > > > > All of which is unnecessary with LCO. An AF_XDP user can be expected > > to apply LCO and only request checksum insertion for the innermost > > checksum. > > > > The biggest problem is with these devices that parse in hardware (and > > possibly also in the driver to identify and fix up hardware > > limitations) is that they will fail if encountering an unknown > > protocol. Which brings us to advertising limited typed support: > > NETIF_F_HW_CSUM vs NETIF_F_IP_CSUM. > > > > The fact that some devices that deviate from industry best practices > > cannot support more advanced packet formats is unfortunate, but not a > > reason to hold others back. No different from current kernel path. The > > BPF program can fallback onto software checksumming on these devices, > > like the kernel path. Perhaps we do need to pass along with csum_start > > and csum_off a csum_type that matches the existing > > NETIF_F_[HW|IP|IPV6]_CSUM, to let drivers return with -EOPNOTSUPP > > quickly if for the generic case. > > > > For implementation in essence it is just reordering driver code that > > already exists for the skb case. I think the ice patch series to > > support rx timestamping is a good indication of what it takes to > > support XDP kfuncs: not so much new code, but reordering the driver > > logic. > > > > Which also indicates to me that the driver *is* the right place to > > implement this logic, rather than reimplement it in a BPF library. It > > avoids both code duplication and dependency hell, if the library ships > > independent from the driver. > > Agree with all of the above. > I think defining CHECKSUM_PARTIAL struct request for af_xdp is doable and > won't require much changes in the drivers. > If we do it for more than one driver from the start there is a chance it > will work for other drivers too. imo ice+gve+mlx5 would be enough. Basically, add to AF_XDP what we already have for its predecessor AF_PACKET: setsockopt PACKET_VNET_HDR? Possibly with a separate new struct, rather than virtio_net_hdr. As that has dependencies on other drivers, notably virtio and its specification process.