Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp

Stanislav Fomichev <sdf@xxxxxxxxxx> · Thu, 17 Nov 2022 09:51:46 -0800

On Wed, Nov 16, 2022 at 10:55 PM John Fastabend
<john.fastabend@xxxxxxxxx> wrote:
>
> Stanislav Fomichev wrote:
> > On Wed, Nov 16, 2022 at 6:59 PM Alexei Starovoitov
> > <alexei.starovoitov@xxxxxxxxx> wrote:
> > >
> > > On Wed, Nov 16, 2022 at 6:53 PM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote:
> > > >
> > > > On Wed, Nov 16, 2022 at 6:17 PM Alexei Starovoitov
> > > > <alexei.starovoitov@xxxxxxxxx> wrote:
> > > > >
> > > > > On Wed, Nov 16, 2022 at 4:19 PM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote:
> > > > > >
> > > > > > On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@xxxxxxxxx> wrote:
> > > > > > >
> > > > > > > Stanislav Fomichev wrote:
> > > > > > > > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend
> > > > > > > > <john.fastabend@xxxxxxxxx> wrote:
> > > > > > > > >
> > > > > > > > > Toke Høiland-Jørgensen wrote:
> > > > > > > > > > Martin KaFai Lau <martin.lau@xxxxxxxxx> writes:
> > > > > > > > > >
> > > > > > > > > > > On 11/15/22 10:38 PM, John Fastabend wrote:
> > > > > > > > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > > > > > > > > > >>>>>>> +                           struct bpf_patch *patch)
> > > > > > > > > > >>>>>>> +{
> > > > > > > > > > >>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > > > > > > > > > >>>>>>> +             /* return true; */
> > > > > > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > > > > > > > > > >>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > > > > > > > > > >>>>>>> +             /* return ktime_get_mono_fast_ns(); */
> > > > > > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> > > > > > > > > > >>>>>>> +     }
> > > > > > > > > > >>>>>>> +}
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> So these look reasonable enough, but would be good to see some examples
> > > > > > > > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
> > > > > > > > > > >>>>>> (with those helper wrappers we were discussing before).
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> Let's maybe add them if/when needed as we add more metadata support?
> > > > > > > > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> > > > > > > > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some
> > > > > > > > > > >>>>> point?
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Well, the reason I asked for them is that I think having to maintain the
> > > > > > > > > > >>>> BPF code generation in the drivers is probably the biggest drawback of
> > > > > > > > > > >>>> the kfunc approach, so it would be good to be relatively sure that we
> > > > > > > > > > >>>> can manage that complexity (via helpers) before we commit to this :)
> > > > > > > > > > >>>
> > > > > > > > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge
> > > > > > > > > > >>> whether that complexity is manageable or not :-)
> > > > > > > > > > >>> Do you want me to add those wrappers you've back without any real users?
> > > > > > > > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper
> > > > > > > > > > >>> objections; I can maybe bring some of this back gated by some
> > > > > > > > > > >>> static_branch to avoid the fastpath cost?
> > > > > > > > > > >>
> > > > > > > > > > >> I missed the context a bit what did you mean "would be good to see some
> > > > > > > > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel
> > > > > > > > > > >> function"? In this case do you mean BPF code directly without the call?
> > > > > > > > > > >>
> > > > > > > > > > >> Early on I thought we should just expose the rx_descriptor which would
> > > > > > > > > > >> be roughly the same right? (difference being code embedded in driver vs
> > > > > > > > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
> > > > > > > > > > >> which wasn't as straight forward as the simpler just read it from
> > > > > > > > > > >> the descriptor. For example in mlx getting the ts would be easy from
> > > > > > > > > > >> BPF with the mlx4_cqe struct exposed
> > > > > > > > > > >>
> > > > > > > > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> > > > > > > > > > >> {
> > > > > > > > > > >>          u64 hi, lo;
> > > > > > > > > > >>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
> > > > > > > > > > >>
> > > > > > > > > > >>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
> > > > > > > > > > >>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
> > > > > > > > > > >>
> > > > > > > > > > >>          return hi | lo;
> > > > > > > > > > >> }
> > > > > > > > > > >>
> > > > > > > > > > >> but converting that to nsec is a bit annoying,
> > > > > > > > > > >>
> > > > > > > > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> > > > > > > > > > >>                              struct skb_shared_hwtstamps *hwts,
> > > > > > > > > > >>                              u64 timestamp)
> > > > > > > > > > >> {
> > > > > > > > > > >>          unsigned int seq;
> > > > > > > > > > >>          u64 nsec;
> > > > > > > > > > >>
> > > > > > > > > > >>          do {
> > > > > > > > > > >>                  seq = read_seqbegin(&mdev->clock_lock);
> > > > > > > > > > >>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
> > > > > > > > > > >>          } while (read_seqretry(&mdev->clock_lock, seq));
> > > > > > > > > > >>
> > > > > > > > > > >>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
> > > > > > > > > > >>          hwts->hwtstamp = ns_to_ktime(nsec);
> > > > > > > > > > >> }
> > > > > > > > > > >>
> > > > > > > > > > >> I think the nsec is what you really want.
> > > > > > > > > > >>
> > > > > > > > > > >> With all the drivers doing slightly different ops we would have
> > > > > > > > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
> > > > > > > > > > >> at least the mlx and ice drivers it looks like we would need some
> > > > > > > > > > >> more BPF primitives/helpers. Looks like some more work is needed
> > > > > > > > > > >> on ice driver though to get rx tstamps on all packets.
> > > > > > > > > > >>
> > > > > > > > > > >> Anyways this convinced me real devices will probably use BPF_CALL
> > > > > > > > > > >> and not BPF insns directly.
> > > > > > > > > > >
> > > > > > > > > > > Some of the mlx5 path looks like this:
> > > > > > > > > > >
> > > > > > > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
> > > > > > > > > > >
> > > > > > > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
> > > > > > > > > > >                                                u64 timestamp)
> > > > > > > > > > > {
> > > > > > > > > > >          u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
> > > > > > > > > > >
> > > > > > > > > > >          return ns_to_ktime(time);
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > If some hints are harder to get, then just doing a kfunc call is better.
> > > > > > > > > >
> > > > > > > > > > Sure, but if we end up having a full function call for every field in
> > > > > > > > > > the metadata, that will end up having a significant performance impact
> > > > > > > > > > on the XDP data path (thinking mostly about the skb metadata case here,
> > > > > > > > > > which will collect several bits of metadata).
> > > > > > > > > >
> > > > > > > > > > > csum may have a better chance to inline?
> > > > > > > > > >
> > > > > > > > > > Yup, I agree. Including that also makes it possible to benchmark this
> > > > > > > > > > series against Jesper's; which I think we should definitely be doing
> > > > > > > > > > before merging this.
> > > > > > > > >
> > > > > > > > > Good point I got sort of singularly focused on timestamp because I have
> > > > > > > > > a use case for it now.
> > > > > > > > >
> > > > > > > > > Also hash is often sitting in the rx descriptor.
> > > > > > > >
> > > > > > > > Ack, let me try to add something else (that's more inline-able) on the
> > > > > > > > rx side for a v2.
> > > > > > >
> > > > > > > If you go with in-kernel BPF kfunc approach (vs user space side) I think
> > > > > > > you also need to add CO-RE to be friendly for driver developers? Otherwise
> > > > > > > they have to keep that read in sync with the descriptors? Also need to
> > > > > > > handle versioning of descriptors where depending on specific options
> > > > > > > and firmware and chip being enabled the descriptor might be moving
> > > > > > > around. Of course can push this all to developer, but seems not so
> > > > > > > nice when we have the machinery to do this and we handle it for all
> > > > > > > other structures.
> > > > > > >
> > > > > > > With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and
> > > > > > > expect CO-RE sorts it out based on currently running btf_id of the
> > > > > > > descriptor. If you go through normal path you get this for free of
> > > > > > > course.
> > > > > >
> > > > > > Doesn't look like the descriptors are as nice as you're trying to
> > > > > > paint them (with clear hash/csum fields) :-) So not sure how much
> > > > > > CO-RE would help.
> > > > > > At least looking at mlx4 rx_csum, the driver consults three different
> > > > > > sets of flags to figure out the hash_type. Or am I just unlucky with
> > > > > > mlx4?
> > > > >
> > > > > Which part are you talking about ?
> > > > >         hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
> > > > > is trivial enough for bpf prog to do if it has access to 'cqe' pointer
> > > > > which is what John is proposing (I think).
>
> Yeah this is what I've been considering. If you just get the 'cqe' pointer
> walking the check_sum and rxhash should be straightforward.
>
> There seems to be a real difference between timestamps and most other
> fields IMO. Timestamps require some extra logic to turn into ns when
> using the NIC hw clock. And the hw clock requires some coordination to
> keep in sync and stop from overflowing and may be done through other
> protocols like PTP in my use case. In some cases I think the clock is
> also shared amongst multiple phys. Seems mlx has a seqlock_t to protect
> it and I'm less sure about this but seems intel nic maybe has a sideband
> control channel.
>
> Then there is everything else that I can think up that is per packet data
> and requires no coordination with the driver. Its just reading fields in
> the completion queue. This would be the csum, check_sum, vlan_header and
> so on. Sure we could kfunc each one of those things, but could also just
> write that directly in BPF and remove some abstractions and kernel
> dependency by doing it directly in the BPF program. If you like that
> abstraction seems to be the point of contention my opinion is the cost
> of kernel depency is high and I can abstract it with a user library
> anyways so burying it in the kernel only causes me support issues and
> backwards compat problems.
>
> Hopefully, my position is more clear.

Yeah, I see your point, I'm somewhat in the same position here wrt to
legacy kernels :-)
Exposing raw descriptors seems fine, but imo it shouldn't be the goto
mechanism for the metadata; but rather as a fallback whenever the
driver implementation is missing/buggy. Unless, as you mention below,
there are some libraries in the future to abstract that.
But at least it seems that we agree that there needs to be some other
non-raw-descriptor way to access spinlocked things like the timestamp?

> > > > I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum.
> > > > In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch
> > > > I'm assuming we want to have hash_type available to the progs?
> > > >
> > > > But also, check_csum handles other corner cases:
> > > > - short_frame: we simply force all those small frames to skip checksum complete
> > > > - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from
> > > > IPv6 header
> > > > - get_fixed_ipv4_csum: Although the stack expects checksum which
> > > > doesn't include the pseudo header, the HW adds it
> > >
> > > Of course, but kfunc won't be doing them either.
> > > We're talking XDP fast path.
> > > The mlx4 hw is old and incapable.
> > > No amount of sw can help.
>
> Doesn't this lend itself to letting the XDP BPF program write the
> BPF code to read it out. Maybe someone cares about these details
> for some cpumap thing, but the rest of us wont care we might just
> want to read check_csum. Maybe we have an IPv6 only network or
> IPv4 network so can make further shortcuts. If a driver dev does
> this they will be forced to do the cactch all version because
> they have no way to know such details.
>
> > > > So it doesn't look like we can just unconditionally use cqe->checksum?
> > > > The driver does a lot of massaging around that field to make it
> > > > palatable.
> > >
> > > Of course we can. cqe->checksum is still usable. the bpf prog
> > > would need to know what it's reading.
> > > There should be no attempt to present a unified state of hw bits.
> > > That's what skb is for. XDP layer should not hide such hw details.
> > > Otherwise it will become a mini skb layer with all that overhead.
> >
> > I was hoping the kfunc could at least parse the flags and return some
> > pkt_hash_types-like enum to indicate what this csum covers.
> > So the users won't have to find the hardware manuals (not sure they
> > are even available?) to decipher what numbers they've got.
> > Regarding old mlx4: true, but mlx5's mlx5e_handle_csum doesn't look
> > that much different :-(
>
> The driver developers could still provide and ship the BPF libs
> with their drivers. I think if someone is going to use their NIC
> and lots of them and requires XDP it will get done. We could put
> them by the driver code mlx4.bpf or something.
>
> >
> > But going back a bit: I'm probably missing what John has been
> > suggesting. How is CO-RE relevant for kfuncs? kfuncs are already doing
> > a CO-RE-like functionality by rewriting some "public api" (kfunc) into
> > the bytecode to access the relevant data.
>
> This was maybe a bit of an aside. What I was pondering a bit out
> loud perhaps is my recollection is there may be a few different
> descriptor layouts depending features enabled, exact device loaded
> and such. So in this case if I was a driver writer I might not want
> to hardcode the offset of the check_sum field. If I could use CO-RE
> then I don't have to care if in one version is the Nth field and later on
> someone makes it the Mth field just like any normal kernel struct.
> But through the kfunc interface I couldn't see how to get that.
> So instead of having a bunch of kfunc implementations you could just
> have one for all your device classes because you always name the
> field the same thing.