Stanislav Fomichev wrote: > On Wed, Nov 16, 2022 at 10:55 PM John Fastabend > <john.fastabend@xxxxxxxxx> wrote: > > > > Stanislav Fomichev wrote: > > > On Wed, Nov 16, 2022 at 6:59 PM Alexei Starovoitov > > > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > > > > > On Wed, Nov 16, 2022 at 6:53 PM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote: > > > > > > > > > > On Wed, Nov 16, 2022 at 6:17 PM Alexei Starovoitov > > > > > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > > > > > > > > > On Wed, Nov 16, 2022 at 4:19 PM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@xxxxxxxxx> wrote: > > > > > > > > > > > > > > > > Stanislav Fomichev wrote: > > > > > > > > > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend > > > > > > > > > <john.fastabend@xxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > Toke Høiland-Jørgensen wrote: > > > > > > > > > > > Martin KaFai Lau <martin.lau@xxxxxxxxx> writes: > > > > > > > > > > > > > > > > > > > > > > > On 11/15/22 10:38 PM, John Fastabend wrote: > > > > > > > > > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id, > > > > > > > > > > > >>>>>>> + struct bpf_patch *patch) > > > > > > > > > > > >>>>>>> +{ > > > > > > > > > > > >>>>>>> + if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) { > > > > > > > > > > > >>>>>>> + /* return true; */ > > > > > > > > > > > >>>>>>> + bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1)); > > > > > > > > > > > >>>>>>> + } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) { > > > > > > > > > > > >>>>>>> + /* return ktime_get_mono_fast_ns(); */ > > > > > > > > > > > >>>>>>> + bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns)); > > > > > > > > > > > >>>>>>> + } > > > > > > > > > > > >>>>>>> +} > > > > > > > > > > > >>>>>> > > > > > > > > > > > >>>>>> So these look reasonable enough, but would be good to see some examples > > > > > > > > > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function > > > > > > > > > > > >>>>>> (with those helper wrappers we were discussing before). > > > > > > > > > > > >>>>> > > > > > > > > > > > >>>>> Let's maybe add them if/when needed as we add more metadata support? > > > > > > > > > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more > > > > > > > > > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some > > > > > > > > > > > >>>>> point? > > > > > > > > > > > >>>> > > > > > > > > > > > >>>> Well, the reason I asked for them is that I think having to maintain the > > > > > > > > > > > >>>> BPF code generation in the drivers is probably the biggest drawback of > > > > > > > > > > > >>>> the kfunc approach, so it would be good to be relatively sure that we > > > > > > > > > > > >>>> can manage that complexity (via helpers) before we commit to this :) > > > > > > > > > > > >>> > > > > > > > > > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge > > > > > > > > > > > >>> whether that complexity is manageable or not :-) > > > > > > > > > > > >>> Do you want me to add those wrappers you've back without any real users? > > > > > > > > > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper > > > > > > > > > > > >>> objections; I can maybe bring some of this back gated by some > > > > > > > > > > > >>> static_branch to avoid the fastpath cost? > > > > > > > > > > > >> > > > > > > > > > > > >> I missed the context a bit what did you mean "would be good to see some > > > > > > > > > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel > > > > > > > > > > > >> function"? In this case do you mean BPF code directly without the call? > > > > > > > > > > > >> > > > > > > > > > > > >> Early on I thought we should just expose the rx_descriptor which would > > > > > > > > > > > >> be roughly the same right? (difference being code embedded in driver vs > > > > > > > > > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs > > > > > > > > > > > >> which wasn't as straight forward as the simpler just read it from > > > > > > > > > > > >> the descriptor. For example in mlx getting the ts would be easy from > > > > > > > > > > > >> BPF with the mlx4_cqe struct exposed > > > > > > > > > > > >> > > > > > > > > > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe) > > > > > > > > > > > >> { > > > > > > > > > > > >> u64 hi, lo; > > > > > > > > > > > >> struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe; > > > > > > > > > > > >> > > > > > > > > > > > >> lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo); > > > > > > > > > > > >> hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16; > > > > > > > > > > > >> > > > > > > > > > > > >> return hi | lo; > > > > > > > > > > > >> } > > > > > > > > > > > >> > > > > > > > > > > > >> but converting that to nsec is a bit annoying, > > > > > > > > > > > >> > > > > > > > > > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev, > > > > > > > > > > > >> struct skb_shared_hwtstamps *hwts, > > > > > > > > > > > >> u64 timestamp) > > > > > > > > > > > >> { > > > > > > > > > > > >> unsigned int seq; > > > > > > > > > > > >> u64 nsec; > > > > > > > > > > > >> > > > > > > > > > > > >> do { > > > > > > > > > > > >> seq = read_seqbegin(&mdev->clock_lock); > > > > > > > > > > > >> nsec = timecounter_cyc2time(&mdev->clock, timestamp); > > > > > > > > > > > >> } while (read_seqretry(&mdev->clock_lock, seq)); > > > > > > > > > > > >> > > > > > > > > > > > >> memset(hwts, 0, sizeof(struct skb_shared_hwtstamps)); > > > > > > > > > > > >> hwts->hwtstamp = ns_to_ktime(nsec); > > > > > > > > > > > >> } > > > > > > > > > > > >> > > > > > > > > > > > >> I think the nsec is what you really want. > > > > > > > > > > > >> > > > > > > > > > > > >> With all the drivers doing slightly different ops we would have > > > > > > > > > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get > > > > > > > > > > > >> at least the mlx and ice drivers it looks like we would need some > > > > > > > > > > > >> more BPF primitives/helpers. Looks like some more work is needed > > > > > > > > > > > >> on ice driver though to get rx tstamps on all packets. > > > > > > > > > > > >> > > > > > > > > > > > >> Anyways this convinced me real devices will probably use BPF_CALL > > > > > > > > > > > >> and not BPF insns directly. > > > > > > > > > > > > > > > > > > > > > > > > Some of the mlx5 path looks like this: > > > > > > > > > > > > > > > > > > > > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low)) > > > > > > > > > > > > > > > > > > > > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock, > > > > > > > > > > > > u64 timestamp) > > > > > > > > > > > > { > > > > > > > > > > > > u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF); > > > > > > > > > > > > > > > > > > > > > > > > return ns_to_ktime(time); > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > If some hints are harder to get, then just doing a kfunc call is better. > > > > > > > > > > > > > > > > > > > > > > Sure, but if we end up having a full function call for every field in > > > > > > > > > > > the metadata, that will end up having a significant performance impact > > > > > > > > > > > on the XDP data path (thinking mostly about the skb metadata case here, > > > > > > > > > > > which will collect several bits of metadata). > > > > > > > > > > > > > > > > > > > > > > > csum may have a better chance to inline? > > > > > > > > > > > > > > > > > > > > > > Yup, I agree. Including that also makes it possible to benchmark this > > > > > > > > > > > series against Jesper's; which I think we should definitely be doing > > > > > > > > > > > before merging this. > > > > > > > > > > > > > > > > > > > > Good point I got sort of singularly focused on timestamp because I have > > > > > > > > > > a use case for it now. > > > > > > > > > > > > > > > > > > > > Also hash is often sitting in the rx descriptor. > > > > > > > > > > > > > > > > > > Ack, let me try to add something else (that's more inline-able) on the > > > > > > > > > rx side for a v2. > > > > > > > > > > > > > > > > If you go with in-kernel BPF kfunc approach (vs user space side) I think > > > > > > > > you also need to add CO-RE to be friendly for driver developers? Otherwise > > > > > > > > they have to keep that read in sync with the descriptors? Also need to > > > > > > > > handle versioning of descriptors where depending on specific options > > > > > > > > and firmware and chip being enabled the descriptor might be moving > > > > > > > > around. Of course can push this all to developer, but seems not so > > > > > > > > nice when we have the machinery to do this and we handle it for all > > > > > > > > other structures. > > > > > > > > > > > > > > > > With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and > > > > > > > > expect CO-RE sorts it out based on currently running btf_id of the > > > > > > > > descriptor. If you go through normal path you get this for free of > > > > > > > > course. > > > > > > > > > > > > > > Doesn't look like the descriptors are as nice as you're trying to > > > > > > > paint them (with clear hash/csum fields) :-) So not sure how much > > > > > > > CO-RE would help. > > > > > > > At least looking at mlx4 rx_csum, the driver consults three different > > > > > > > sets of flags to figure out the hash_type. Or am I just unlucky with > > > > > > > mlx4? > > > > > > > > > > > > Which part are you talking about ? > > > > > > hw_checksum = csum_unfold((__force __sum16)cqe->checksum); > > > > > > is trivial enough for bpf prog to do if it has access to 'cqe' pointer > > > > > > which is what John is proposing (I think). > > > > Yeah this is what I've been considering. If you just get the 'cqe' pointer > > walking the check_sum and rxhash should be straightforward. > > > > There seems to be a real difference between timestamps and most other > > fields IMO. Timestamps require some extra logic to turn into ns when > > using the NIC hw clock. And the hw clock requires some coordination to > > keep in sync and stop from overflowing and may be done through other > > protocols like PTP in my use case. In some cases I think the clock is > > also shared amongst multiple phys. Seems mlx has a seqlock_t to protect > > it and I'm less sure about this but seems intel nic maybe has a sideband > > control channel. > > > > Then there is everything else that I can think up that is per packet data > > and requires no coordination with the driver. Its just reading fields in > > the completion queue. This would be the csum, check_sum, vlan_header and > > so on. Sure we could kfunc each one of those things, but could also just > > write that directly in BPF and remove some abstractions and kernel > > dependency by doing it directly in the BPF program. If you like that > > abstraction seems to be the point of contention my opinion is the cost > > of kernel depency is high and I can abstract it with a user library > > anyways so burying it in the kernel only causes me support issues and > > backwards compat problems. > > > > Hopefully, my position is more clear. > > Yeah, I see your point, I'm somewhat in the same position here wrt to > legacy kernels :-) > Exposing raw descriptors seems fine, but imo it shouldn't be the goto > mechanism for the metadata; but rather as a fallback whenever the > driver implementation is missing/buggy. Unless, as you mention below, > there are some libraries in the future to abstract that. > But at least it seems that we agree that there needs to be some other > non-raw-descriptor way to access spinlocked things like the timestamp? > Yeah for timestamps I think a kfunc to either get the timestamp or could also be done with a kfunc to read hw clock. But either way seems hard to do that in BPF code directly so kfunc feels right to me here. By the way I think if you have the completion queue (rx descriptor) in the xdp_buff and we use Yonghong's patch to cast the ctx as a BTF type then we should be able to also directly read all the fields. I see you noted this in the response to Alexei so lets see what he thinks.