Re: XDP-hints: Storing some bits in XDP metadata and others in xdp_frame/xdp-buff (was: Checksum for xdp_frame)

Magnus Karlsson <magnus.karlsson@xxxxxxxxx> · Wed, 26 May 2021 10:58:08 +0200

On Wed, May 26, 2021 at 10:29 AM Jesper Dangaard Brouer
<brouer@xxxxxxxxxx> wrote:
>
>
> (Cc. upstream bpf-list as Magnus brought up a good question and
> important AF_XDP view-point)
>
> On Tue, 25 May 2021 12:40:03 +0000
> "Karlsson, Magnus" <magnus.karlsson@xxxxxxxxx> wrote:
>
> > > -----Original Message-----
> > > From: Jesper Dangaard Brouer <brouer@xxxxxxxxxx>
> > > Subject: Fwd: Checksum for xdp_frame
> > >
> > > The important part is this commit by Ahern:
> > > [1] https://github.com/dsahern/linux/commit/b6b4d4ef9562383d8b407a873d30
> > >
> > > The patch use bitfields, which we now know is a bad idea for performance,
> > > so that needs to change.
>
> We were discussing above patch [1].  That implement CHECKSUM_UNNECESSARY
> HW indication, by storing two-bits in xdp_buff, that is later
> transferred to xdp_frame, which use them to populate skb->ip_summed.
> (Plan is for Lorenzo to continue this work).
>
> Measurements[2] show a need for this.
>  [2] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp_frame01_checksum.org
>
> > [...]
> >
> > Some initial thoughts from an AF_XDP point of view. For every one of
> > these "metadata" items being it IP checksumed, Rx/Tx timestamps,
> > packet continues in next buffer, VLAN id, RSS hash, etc. we can pick
> > one of these methods:
> >
> > 1: Put it in the XDP metadata section before the packet. Good since
> > it requires no intervention to convert this to AF_XDP and it is
> > large. AF_XDP uses exactly the same metadata section as XDP. The
> > drawback here might be that we touch a new cache line unless we play
> > with headroom so that it is located on the same cache line as the
> > start of the packet (and use that too).
> >
> > 2: We put it in the struct xdp_buff which is invisible to user space
> > and therefore must be copied out to either the XDP metadata section
> > or to the __u32 options field in the AF_XDP Rx descriptor. As we add
> > stuff to the xdp_buff, this will become a scalability problem and we
> > will create a mini skb. Moreover, we only have space for 32 bits of
> > information in the AF_XDP Rx options field and in contrast to the
> > xdp_buff, we can never increase this, so AF_XDP needs to put things
> > in the XDP metadata section sooner or later. The only advantage with
> > this approach is that if we put the item in the options field, this
> > will be fast since it will very likely be in the L1 cache. But since
> > it is only 32 bits, we have to pick what goes in the options field
> > very carefully.
> >
> > One thing I think should go into the options field is the
> > multi-buffer flag in Lorenzos multi-buffer patch set since that has
> > to be checked all the time in multi-buffer mode and it has to do with
> > frames/descriptors composing a packet. (multi_buffer is a property on
> > each descriptor/frame, while ip_checksummed is a property on the
> > packet but not each descriptor.) But for all the rest, I think we
> > should use the XDP metadata field. I have not read David's mail, but
> > what is the argument for having ip_checksummed in the xdp_buff? Why
> > not any of the other metadata items that could be equally or more
> > important for my app? Putting it in the XDP metadata requires a lot
> > of plumbing before we can realize #1 so that is one good short-term
> > argument for #2. But I think we need to take the step towards XDP
> > metadata now. #2 is not a scalable approach, not even for the
> > xdp_buff. Opinions? What am I missing?
>
> Thanks for framing the issue/dilemma very accurately.
>
> My perspective is converting xdp_frame into SKB, where we *also* need
> some of these HW-hints.  This is very easy with method #2, where we
> simply extend the C-structs to contain more info, but these are fixed
> fields that add a small constant/fixed overhead.  One could argue that
> it is naturally limited by what the SKB have fields for, but AF_XDP
> also need visibility into these fields.  I'm all for going in method #1
> direction, but I don't fully know/understand how the kernel C-code can
> access fields in the BTF described XDP metadata area?
>
> In my opinion we could/can "allow" the HW-checksum "ok" indication to
> use method #2, as shown in Ahern's patch[1].  The argument is that
> almost all hardware provide this.

Agree with that. There should be a small set of standard items that
all vendors have in their NICs such as in your examples above. I might
argue that timestamp and a "least common denominator" RSS hash would
also qualify to this list of standard items. (If available, a larger
RSS hash could be accessed using the NIC specific features below.)
Regardless of the list, it makes sense to have a flags field in the
AF_XDP Rx descriptor and in the xdp_buff that quickly can be tested
for anomalies and values that are present in the XDP metadata. If this
flags field is 0, everything is fine and there is no data for this
packet in the XDP metadata section, so no need to check and lose
performance. If the field is non-zero, then we need to act. For
ip_checksummed it means that something is wrong with the checksum but
no need to check XDP metadata for that (unless we want to save one bit
and not use 2 in the flags field). For a timestamp, it means that a
timestamp is present for this packet and I would need to fetch it in
the metadata section. And so on. So I am all for a flags field (max
32-bits please so it can be put in the currently unused options field
of the AF_XDP Rx descriptor) that can be quickly tested to improve
performance.

I think it is really important that all the metadata items are
optional, that they are all off unless explicitly and individually
turned on. These items are not for free. It costs to check and fetch
them from the HW, to populate the flags field and if needed the XDP
metadata field. It might even cost to have them turned on in the HW.
If I do not need ip_checksummed, multi-buffer, timestamp, or whatever,
I should not have to pay for it. Assuming everything will be turned on
(or just one big on/off switch) is not scalable. We do not want to end
up with a "mini skb" where everything must be populated. Keep XDP fast
;-)!

> The next natural field for method #2 seems to be "rxhash32" (32-bit
> RSS-hash).  This is also something we know almost all hardware provide,
> but IMHO is would be a mistake to use method #2.  First of all OVS
> AF_XDP (vmware Cc William) have RFC-patches[3] that AF_XDP need access
> to this.  Second, keeping this 32-bit is limiting hardware, as some NIC
> hardware (Mellanox and Napatech) support a 64-bit hash that is uniquely
> identify flows.  Mellanox also support using this RX-descriptor field
> for containing the skb->mark.  Thus, the flexibility of method #1 is
> preferred. (But how do I access this during xdp_frame to SKB creation?).
>
> As Magnus also said, I (also) think we need to take the step towards XDP
> metadata (method #1) with BTF... but we need some help from BTF experts.
>
>
> [3] http://patchwork.ozlabs.org/project/openvswitch/patch/1614882425-52800-1-git-send-email-u9012063@xxxxxxxxx/
> - -
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
>
> (Just to remind myself: Cache-line details about method #2 is that
> xdp_buff lives on call-stack and is cache-hot.  During redirect
> xdp_buff is converted to xdp_frame, which imply copying info into new
> memory area. The xdp_frame memory is located in top of data-frame.  The
> xdp_frame mem is prefetched in drivers to hide this cache-line miss).
>