Re: [RFC bpf-next 5/5] selftests/bpf: Test rx_timestamp metadata in xskxceiver

Stanislav Fomichev <sdf@xxxxxxxxxx> · Tue, 1 Nov 2022 13:12:55 -0700

On Tue, Nov 1, 2022 at 6:18 AM Jesper Dangaard Brouer
<jbrouer@xxxxxxxxxx> wrote:
>
>
>
> On 31/10/2022 18.00, Stanislav Fomichev wrote:
> > On Mon, Oct 31, 2022 at 7:22 AM Alexander Lobakin
> > <alexandr.lobakin@xxxxxxxxx> wrote:
> >>
> >> From: Stanislav Fomichev <sdf@xxxxxxxxxx>
> >> Date: Fri, 28 Oct 2022 11:46:14 -0700
> >>
> >>> On Fri, Oct 28, 2022 at 3:37 AM Jesper Dangaard Brouer
> >>> <jbrouer@xxxxxxxxxx> wrote:
> >>>>
> >>>>
> >>>> On 28/10/2022 08.22, Martin KaFai Lau wrote:
> >>>>> On 10/27/22 1:00 PM, Stanislav Fomichev wrote:
> >>>>>> Example on how the metadata is prepared from the BPF context
> >>>>>> and consumed by AF_XDP:
> >>>>>>
> >>>>>> - bpf_xdp_metadata_have_rx_timestamp to test whether it's supported;
> >>>>>>     if not, I'm assuming verifier will remove this "if (0)" branch
> >>>>>> - bpf_xdp_metadata_rx_timestamp returns a _copy_ of metadata;
> >>>>>>     the program has to bpf_xdp_adjust_meta+memcpy it;
> >>>>>>     maybe returning a pointer is better?
> >>>>>> - af_xdp consumer grabs it from data-<expected_metadata_offset> and
> >>>>>>     makes sure timestamp is not empty
> >>>>>> - when loading the program, we pass BPF_F_XDP_HAS_METADATA+prog_ifindex
> >>>>>>
> >>>>>> Cc: Martin KaFai Lau <martin.lau@xxxxxxxxx>
> >>>>>> Cc: Jakub Kicinski <kuba@xxxxxxxxxx>
> >>>>>> Cc: Willem de Bruijn <willemb@xxxxxxxxxx>
> >>>>>> Cc: Jesper Dangaard Brouer <brouer@xxxxxxxxxx>
> >>>>>> Cc: Anatoly Burakov <anatoly.burakov@xxxxxxxxx>
> >>>>>> Cc: Alexander Lobakin <alexandr.lobakin@xxxxxxxxx>
> >>>>>> Cc: Magnus Karlsson <magnus.karlsson@xxxxxxxxx>
> >>>>>> Cc: Maryam Tahhan <mtahhan@xxxxxxxxxx>
> >>>>>> Cc: xdp-hints@xxxxxxxxxxxxxxx
> >>>>>> Cc: netdev@xxxxxxxxxxxxxxx
> >>>>>> Signed-off-by: Stanislav Fomichev <sdf@xxxxxxxxxx>
> >>>>>> ---
> >>>>>>    .../testing/selftests/bpf/progs/xskxceiver.c  | 22 ++++++++++++++++++
> >>>>>>    tools/testing/selftests/bpf/xskxceiver.c      | 23 ++++++++++++++++++-
> >>>>>>    2 files changed, 44 insertions(+), 1 deletion(-)
> >>
> >> [...]
> >>
> >>>> IMHO sizeof() should come from a struct describing data_meta area see:
> >>>>
> >>>> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L62
> >>>
> >>> I guess I should've used pointers for the return type instead, something like:
> >>>
> >>> extern __u64 *bpf_xdp_metadata_rx_timestamp(struct xdp_md *ctx) __ksym;
> >>>
> >>> {
> >>>     ...
> >>>      __u64 *rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
> >>>      if (rx_timestamp) {
> >>>          bpf_xdp_adjust_meta(ctx, -(int)sizeof(*rx_timestamp));
> >>>          __builtin_memcpy(data_meta, rx_timestamp, sizeof(*rx_timestamp));
> >>>      }
> >>> }
> >>>
> >>> Does that look better?
> >>
> >> I guess it will then be resolved to a direct store, right?
> >> I mean, to smth like
> >>
> >>          if (rx_timestamp)
> >>                  *(u32 *)data_meta = *rx_timestamp;
> >>
> >> , where *rx_timestamp points directly to the Rx descriptor?
> >
> > Right. I should've used that form from the beginning, that memcpy is
> > confusing :-(
> >
> >>>
> >>>>>> +        if (ret != 0)
> >>>>>> +            return XDP_DROP;
> >>>>>> +
> >>>>>> +        data = (void *)(long)ctx->data;
> >>>>>> +        data_meta = (void *)(long)ctx->data_meta;
> >>>>>> +
> >>>>>> +        if (data_meta + sizeof(__u32) > data)
> >>>>>> +            return XDP_DROP;
> >>>>>> +
> >>>>>> +        rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
> >>>>>> +        __builtin_memcpy(data_meta, &rx_timestamp, sizeof(__u32));
> >>>>
> >>>> So, this approach first stores hints on some other memory location, and
> >>>> then need to copy over information into data_meta area. That isn't good
> >>>> from a performance perspective.
> >>>>
> >>>> My idea is to store it in the final data_meta destination immediately.
> >>>
> >>> This approach doesn't have to store the hints in the other memory
> >>> location. xdp_buff->priv can point to the real hw descriptor and the
> >>> kfunc can have a bytecode that extracts the data from the hw
> >>> descriptors. For this particular RFC, we can think that 'skb' is that
> >>> hw descriptor for veth driver.
>
> Once you point xdp_buff->priv to the real hw descriptor, then we also
> need to have some additional data/pointers to NIC hardware info + HW
> setup state. You will hit some of the same challenges as John, like
> hardware/firmware revisions and chip models, that Jakub pointed out.
> Because your approach stays with the driver code, I guess it will be a
> bit easier code wise. Maybe we can store data/pointer needed for this in
> xdp_rxq_info (xdp->rxq).
>
> I would need to see some code that juggling this HW NCI state from the
> kfunc expansion to be convinced this is the right approach.

I've replied to Martin in another thread, but that's a good point. We
might need to do locks while parsing the descriptors and converting to
kernel time, so maybe assuming that everything can be unrolled won't
stand 100%. OTOH, it seems like unrolled bytecode can (with some
quirks) always call into some driver specific c function...

I'm trying to convert a couple of drivers (without really testing the
implementation) and see whether there are any other big issues.

> >>
> >> I really do think intermediate stores can be avoided with this
> >> approach.
> >> Oh, and BTW, if we plan to use a particular Hint in the BPF program
> >> only, there's no need to place it in the metadata area at all, is
> >> there? You only want to get it in your code, so just retrieve it to
> >> the stack and that's it. data_meta is only for cases when you want
> >> hints to appear in AF_XDP.
> >
> > Correct.
>
> It is not *only* AF_XDP that need data stored in data_meta.
>
> The stores data_meta are also needed for veth and cpumap, because the HW
> descriptor is "out-of-scope" and thus no-longer available.
>
>
> >
> >>>> Do notice that in my approach, the existing ethtool config setting and
> >>>> socket options (for timestamps) still apply.  Thus, each individual
> >>>> hardware hint are already configurable. Thus we already have a config
> >>>> interface. I do acknowledge, that in-case a feature is disabled it still
> >>>> takes up space in data_meta areas, but importantly it is NOT stored into
> >>>> the area (for performance reasons).
> >>>
> >>> That should be the case with this rfc as well, isn't it? Worst case
> >>> scenario, that kfunc bytecode can explicitly check ethtool options and
> >>> return false if it's disabled?
> >>
> >> (to Jesper)
> >>
> >> Once again, Ethtool idea doesn't work. Let's say you have roughly
> >> 50% of frames to be consumed by XDP, other 50% go to skb path and
> >> the stack. In skb, I want as many HW data as possible: checksums,
> >> hash and so on. Let's say in XDP prog I want only timestamp. What's
> >> then? Disable everything but stamp and kill skb path? Enable
> >> everything and kill XDP path?
> >> Stanislav's approach allows you to request only fields you need from
> >> the BPF prog directly, I don't see any reasons for adding one more
> >> layer "oh no I won't give you checksum because it's disabled via
> >> Ethtool".
> >> Maybe I get something wrong, pls explain then :P
> >
> > Agree, good point.
>
> Stanislav's (and John's proposal) is definitely focused and addressing
> something else than my patchset.
>
> I optimized the XDP-hints population (for i40e) down to 6 nanosec (on
> 3.6 GHz CPU = 21 cycles).  Plus, I added an ethtool switch to turn it
> off for those XDP users that cannot live with this overhead.  Hoping
> this would be fast-enough such that we didn't have to add this layer.
> (If XDP returns XDP_PASS then this decoded info can be used for the SKB
> creation. Thus, this is essentially just moving decoding RX-desc a bit
> earlier in the the driver).
>
> One of my use-cases is getting RX-checksum support in xdp_frame's and
> transferring this to SKB creation time.  I have done a number of
> measurements[1] to find out how much we can gain of performance for UDP
> packets (1500 bytes) with/without RX-checksum.  Initial result showed I
> saved 91 nanosec, but that was avoiding to touching data.  Doing full
> userspace UDP delivery with a copy (or copy+checksum) showed the real
> save was 54 nanosec.  In this context, the 6 nanosec was very small.
> Thus, I didn't choose to pursue a BPF layer for individual fields.
>
>   [1]
> https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp_frame01_checksum.org
>
> Sure it is super cool if we can create this BPF layer that programmable
> selects individual fields from the descriptor, and maybe we ALSO need that.
> Could this layer could still be added after my patchset(?), as one could
> disable the XDP-hints (via ethtool) and then use kfuncs/kptr to extract
> only fields need by the specific XDP-prog use-case.
> Could they also co-exist(?), kfuncs/kptr could extend the
> xdp_hints_rx_common struct (in data_meta area) with more advanced
> offload-hints and then update the BTF-ID (yes, BPF can already resolve
> its own BTF-IDs from BPF-prog code).
>
> Great to see all the discussions and different oppinons :-)
> --Jesper
>