Stanislav Fomichev <sdf@xxxxxxxxxx> writes: > On Thu, Nov 17, 2022 at 8:59 AM Alexei Starovoitov > <alexei.starovoitov@xxxxxxxxx> wrote: >> >> On Thu, Nov 17, 2022 at 3:32 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: >> > >> > Stanislav Fomichev <sdf@xxxxxxxxxx> writes: >> > >> > >> > Doesn't look like the descriptors are as nice as you're trying to >> > >> > paint them (with clear hash/csum fields) :-) So not sure how much >> > >> > CO-RE would help. >> > >> > At least looking at mlx4 rx_csum, the driver consults three different >> > >> > sets of flags to figure out the hash_type. Or am I just unlucky with >> > >> > mlx4? >> > >> >> > >> Which part are you talking about ? >> > >> hw_checksum = csum_unfold((__force __sum16)cqe->checksum); >> > >> is trivial enough for bpf prog to do if it has access to 'cqe' pointer >> > >> which is what John is proposing (I think). >> > > >> > > I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum. >> > > In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch >> > > I'm assuming we want to have hash_type available to the progs? >> > >> > I agree we should expose the hash_type, but that doesn't actually look >> > to be that complicated, see below. >> > >> > > But also, check_csum handles other corner cases: >> > > - short_frame: we simply force all those small frames to skip checksum complete >> > > - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from >> > > IPv6 header >> > > - get_fixed_ipv4_csum: Although the stack expects checksum which >> > > doesn't include the pseudo header, the HW adds it >> > > >> > > So it doesn't look like we can just unconditionally use cqe->checksum? >> > > The driver does a lot of massaging around that field to make it >> > > palatable. >> > >> > Poking around a bit in the other drivers, AFAICT it's only a subset of >> > drivers that support CSUM_COMPLETE at all; for instance, the Intel >> > drivers just set CHECKSUM_UNNECESSARY for TCP/UDP/SCTP. I think the >> > CHECKSUM_UNNECESSARY is actually the most important bit we'd want to >> > propagate? >> > >> > AFAICT, the drivers actually implementing CHECKSUM_COMPLETE need access >> > to other data structures than the rx descriptor to determine the status >> > of the checksum (mlx4 looks at priv->flags, mlx5 checks rq->state), so >> > just exposing the rx descriptor to BPF as John is suggesting does not >> > actually give the XDP program enough information to act on the checksum >> > field on its own. We could still have a separate kfunc to just expose >> > the hw checksum value (see below), but I think it probably needs to be >> > paired with other kfuncs to be useful. >> > >> > Looking at the mlx4 code, I think the following mapping to kfuncs (in >> > pseudo-C) would give the flexibility for XDP to access all the bits it >> > needs, while inlining everything except getting the full checksum for >> > non-TCP/UDP traffic. An (admittedly cursory) glance at some of the other >> > drivers (mlx5, ice, i40e) indicates that this would work for those >> > drivers as well. >> > >> > >> > bpf_xdp_metadata_rx_hash_supported() { >> > return dev->features & NETIF_F_RXHASH; >> > } >> > >> > bpf_xdp_metadata_rx_hash() { >> > return be32_to_cpu(cqe->immed_rss_invalid); >> > } >> > >> > bpf_xdp_metdata_rx_hash_type() { >> > if (likely(dev->features & NETIF_F_RXCSUM) && >> > (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP | MLX4_CQE_STATUS_UDP)) && >> > (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) && >> > cqe->checksum == cpu_to_be16(0xffff)) >> > return PKT_HASH_TYPE_L4; >> > >> > return PKT_HASH_TYPE_L3; >> > } >> > >> > bpf_xdp_metadata_rx_csum_supported() { >> > return dev->features & NETIF_F_RXCSUM; >> > } >> > >> > bpf_xdp_metadata_rx_csum_level() { >> > if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP | >> > MLX4_CQE_STATUS_UDP)) && >> > (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) && >> > cqe->checksum == cpu_to_be16(0xffff)) >> > return CHECKSUM_UNNECESSARY; >> > >> > if (!(priv->flags & MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP && >> > (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IP_ANY))) && >> > !short_frame(len)) >> > return CHECKSUM_COMPLETE; /* we could also omit this case entirely */ >> > >> > return CHECKSUM_NONE; >> > } >> > >> > /* this one could be called by the metadata_to_skb code */ >> > bpf_xdp_metadata_rx_csum_full() { >> > return check_csum() /* BPF_CALL this after refactoring so it is skb-agnostic */ >> > } >> > >> > /* this one would be for people like John who want to re-implement >> > * check_csum() themselves */ >> > bpf_xdp_metdata_rx_csum_raw() { >> > return cqe->checksum; >> > } >> >> Are you proposing a bunch of per-driver kfuncs that bpf prog will call. >> If so that works, but bpf prog needs to pass dev and cqe pointers >> into these kfuncs, so they need to be exposed to the prog somehow. >> Probably through xdp_md ? No, I didn't mean we should call per-driver kfuncs; the examples above were meant to be examples of what the mlx4 driver would unrolls those kfuncs to. Sorry that that wasn't clear. > So far I'm doing: > > struct mlx4_xdp_buff { > struct xdp_buff xdp; > struct mlx4_cqe *cqe; > struct mlx4_en_dev *mdev; > } > > And then the kfuncs get ctx (aka xdp_buff) as a sole argument and can > find cqe/mdev via container_of. > > If we really need these to be exposed to the program, can we use > Yonghong's approach from [0]? I don't think we should expose them to the BPF program; I like your approach of stuffing them next to the CTX pointer and de-referencing that. This makes it up to the driver which extra objects it needs, and the caller doesn't have to know/care. I'm not vehemently opposed to *also* having the rx-desc pointer directly accessible (in which case Yonghong's kfunc approach is probably fine). However, as mentioned in my previous email, I doubt how useful that descriptor itself will be... >> This way we can have both: bpf prog reading cqe fields directly >> and using kfuncs to access things. >> Inlining of kfuncs should be done generically. >> It's not a driver job to convert native asm into bpf asm. > > Ack. I can replace the unrolling with something that just resolves > "generic" kfuncs to the per-driver implementation maybe? That would at > least avoid netdev->ndo_kfunc_xxx indirect calls at runtime.. As stated above, I think we should keep the unrolling. If we end up with an actual CALL instruction for every piece of metadata that's going to suck performance-wise; unrolling is how we keep this fast enough! :) -Toke