Re: [PATCH RFCv2 bpf-next 00/18] XDP-hints: XDP gaining access to HW offload hints via BTF

"Burakov, Anatoly" <anatoly.burakov@xxxxxxxxx> · Wed, 5 Oct 2022 14:14:35 +0100

On 04-Oct-22 10:29 AM, Jesper Dangaard Brouer wrote:

On 04/10/2022 01.55, sdf@xxxxxxxxxx wrote:
On 09/07, Jesper Dangaard Brouer wrote:
This patchset expose the traditional hardware offload hints to XDP and
rely on BTF to expose the layout to users.

Main idea is that the kernel and NIC drivers simply defines the struct
layouts they choose to use for XDP-hints. These XDP-hints structs gets
naturally and automatically described via BTF and implicitly exported to
users. NIC drivers populate and records their own BTF ID as the last
member in XDP metadata area (making it easily accessible by AF_XDP
userspace at a known negative offset from packet data start).

Naming conventions for the structs (xdp_hints_*) is used such that
userspace can find and decode the BTF layout and match against the
provided BTF IDs. Thus, no new UAPI interfaces are needed for exporting
what XDP-hints a driver supports.

The patch "i40e: Add xdp_hints_union" introduce the idea of creating a
union named "xdp_hints_union" in every driver, which contains all
xdp_hints_* struct this driver can support. This makes it easier/quicker
to find and parse the relevant BTF types.  (Seeking input before fixing
up all drivers in patchset).

The main different from RFC-v1:
  - Drop idea of BTF "origin" (vmlinux, module or local)
  - Instead to use full 64-bit BTF ID that combine object+type ID

I've taken some of Alexandr/Larysa's libbpf patches and integrated
those.

Patchset exceeds netdev usually max 15 patches rule. My excuse is three
NIC drivers (i40e, ixgbe and mvneta) gets XDP-hints support and which
required some refactoring to remove the SKB dependencies.

Hey Jesper,

I took a quick look at the series. 
Appreciate that! :-)

Do we really need the enum with the flags?

The primary reason for using enum is that these gets exposed as BTF.
The proposal is that userspace/BTF need to obtain the flags via BTF,
such that they don't become UAPI, but something we can change later.

We might eventually hit that "first 16 bits are reserved" issue?

Instead of exposing enum with the flags, why not solve it as follows:
a. We define UAPI struct xdp_rx_hints with _all_ possible hints

How can we know _all_ possible hints from the beginning(?).

UAPI + central struct dictating all possible hints, will limit innovation.

b. Each device defines much denser <device>_xdp_rx_hints struct with the
    metadata that it supports

Thus, the NIC device is limited to what is defined in UAPI struct
xdp_rx_hints.  Again this limits innovation.

c. The subset of fields in <device>_xdp_rx_hints should match the ones 
from
    xdp_rx_hints (we essentially standardize on the field names/sizes)
d. We expose <device>_xdp_rx_hints btf id via netlink for each device

For this proposed design you would still need more than one BTF ID or
<device>_xdp_rx_hints struct's, because not all packets contains all
hints. The most common case is HW timestamping, which some HW only
supports for PTP frames.

Plus, I don't see a need to expose anything via netlink, as we can just
use the existing BTF information from the module.  Thus, avoiding to
creating more UAPI.

e. libbpf will query and do offset relocations for
    xdp_rx_hints -> <device>_xdp_rx_hints at load time

Would that work? Then it seems like we can replace bitfields with the 

I used to be a fan of bitfields, until I discovered that they are bad
for performance, because compilers cannot optimize these.

following:

   if (bpf_core_field_exists(struct xdp_rx_hints, vlan_tci)) {
     /* use that hint */

Fairly often a VLAN will not be set in packets, so we still have to read
and check a bitfield/flag if the VLAN value is valid. (Guess it is
implicit in above code).

   }

All we need here is for libbpf to, again, do xdp_rx_hints ->
<device>_xdp_rx_hints translation before it evaluates 
bpf_core_field_exists()?

Thoughts? Any downsides? Am I missing something?

Well, the downside is primarily that this design limits innovation.

Each time a NIC driver want to introduce a new hardware hint, they have
to update the central UAPI xdp_rx_hints struct first.

The design in the patchset is to open for innovation.  Driver can extend
their own xdp_hints_<driver>_xxx struct(s).  They still have to land
their patches upstream, but avoid mangling a central UAPI struct. As
upstream we review driver changes and should focus on sane struct member
naming(+size) especially if this "sounds" like a hint/feature that more
driver are likely to support.  With help from BTF relocations, a new
driver can support same hint/feature if naming(+size) match (without
necessary the same offset in the struct).

Also, about the TX side: I feel like the same can be applied there,
the program works with xdp_tx_hints and libbpf will rewrite to
<device>_xdp_tx_hints. xdp_tx_hints might have fields like 
"has_tx_vlan:1";
those, presumably, can be relocatable by libbpf as well?

Good to think ahead for TX-side, even-though I think we should focus on
landing RX-side first.

I notice your naming xdp_rx_hints vs. xdp_tx_hints.  I have named the
common struct xdp_hints_common, without a RX/TX direction indication.
Maybe this is wrong of me, but my thinking was that most of the common
hints can be directly used as TX-side hints.  I'm hoping TX-side
xdp-hints will need to do little-to-non adjustment, before using the
hints as TX "instruction".  I'm hoping that XDP-redirect will just work
and xmit driver can use XDP-hints area.

Please correct me if I'm wrong.
The checksum fields hopefully translates to similar TX offload "actions".
The VLAN offload hint should translate directly to TX-side.

Like I indicated in another response, not necessarily. Rx checksum 
typically indicates that the checksumming was completed and checksum was 
good/bad, but for Tx we actually supply offsets (possibly multiple ones, 
depending on L2/L3/L4 packet, plus there's also a need to distinguish 
between packet types as different NICs will have different offload bits 
for different ptypes) in the metadata. So, while VLAN offload may or may 
not translate directly to the Tx side of things, checksumming probably 
won't.

I can easily be convinced we should name it xdp_hints_rx_common from the
start, but then I will propose that xdp_hints_tx_common have the
checksum and VLAN fields+flags at same locations, such that we don't
take any performance hint for moving them to "TX-side" hints, making
XDP-redirect just work.

--Jesper

--
Thanks,
Anatoly