Re: [PATCH v4 bpf-next 15/22] xsk: add multi-buffer documentation

Magnus Karlsson <magnus.karlsson@xxxxxxxxx> · Wed, 21 Jun 2023 10:06:00 +0200

On Tue, 20 Jun 2023 at 19:34, Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
>
> Maciej Fijalkowski <maciej.fijalkowski@xxxxxxxxx> writes:
>
> > From: Magnus Karlsson <magnus.karlsson@xxxxxxxxx>
> >
> > Add AF_XDP multi-buffer support documentation including two
> > pseudo-code samples.
> >
> > Signed-off-by: Magnus Karlsson <magnus.karlsson@xxxxxxxxx>
> > ---
> >  Documentation/networking/af_xdp.rst | 177 ++++++++++++++++++++++++++++
> >  1 file changed, 177 insertions(+)
> >
> > diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
> > index 247c6c4127e9..2b583f58967b 100644
> > --- a/Documentation/networking/af_xdp.rst
> > +++ b/Documentation/networking/af_xdp.rst
> > @@ -453,6 +453,93 @@ XDP_OPTIONS getsockopt
> >  Gets options from an XDP socket. The only one supported so far is
> >  XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not.
> >
> > +Multi-Buffer Support
> > +--------------------
> > +
> > +With multi-buffer support, programs using AF_XDP sockets can receive
> > +and transmit packets consisting of multiple buffers both in copy and
> > +zero-copy mode. For example, a packet can consist of two
> > +frames/buffers, one with the header and the other one with the data,
> > +or a 9K Ethernet jumbo frame can be constructed by chaining together
> > +three 4K frames.
> > +
> > +Some definitions:
> > +
> > +* A packet consists of one or more frames
> > +
> > +* A descriptor in one of the AF_XDP rings always refers to a single
> > +  frame. In the case the packet consists of a single frame, the
> > +  descriptor refers to the whole packet.
> > +
> > +To enable multi-buffer support for an AF_XDP socket, use the new bind
> > +flag XDP_USE_SG. If this is not provided, all multi-buffer packets
> > +will be dropped just as before. Note that the XDP program loaded also
> > +needs to be in multi-buffer mode. This can be accomplished by using
> > +"xdp.frags" as the section name of the XDP program used.
> > +
> > +To represent a packet consisting of multiple frames, a new flag called
> > +XDP_PKT_CONTD is introduced in the options field of the Rx and Tx
> > +descriptors. If it is true (1) the packet continues with the next
> > +descriptor and if it is false (0) it means this is the last descriptor
> > +of the packet. Why the reverse logic of end-of-packet (eop) flag found
> > +in many NICs? Just to preserve compatibility with non-multi-buffer
> > +applications that have this bit set to false for all packets on Rx,
> > +and the apps set the options field to zero for Tx, as anything else
> > +will be treated as an invalid descriptor.
> > +
> > +These are the semantics for producing packets onto AF_XDP Tx ring
> > +consisting of multiple frames:
> > +
> > +* When an invalid descriptor is found, all the other
> > +  descriptors/frames of this packet are marked as invalid and not
> > +  completed. The next descriptor is treated as the start of a new
> > +  packet, even if this was not the intent (because we cannot guess
> > +  the intent). As before, if your program is producing invalid
> > +  descriptors you have a bug that must be fixed.
> > +
> > +* Zero length descriptors are treated as invalid descriptors.
> > +
> > +* For copy mode, the maximum supported number of frames in a packet is
> > +  equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all
> > +  descriptors accumulated so far are dropped and treated as
> > +  invalid. To produce an application that will work on any system
> > +  regardless of this config setting, limit the number of frags to 18,
> > +  as the minimum value of the config is 17.
> > +
> > +* For zero-copy mode, the limit is up to what the NIC HW
> > +  supports. Usually at least five on the NICs we have checked. We
> > +  consciously chose to not enforce a rigid limit (such as
> > +  CONFIG_MAX_SKB_FRAGS + 1) for zero-copy mode, as it would have
> > +  resulted in copy actions under the hood to fit into what limit
> > +  the NIC supports. Kind of defeats the purpose of zero-copy mode.
>
> How is an application supposed to discover the actual limit for a given
> NIC/driver?

Thanks for your comments Toke. I will add an example here of how to
discover this. Basically you can send a packet with N frags (N = 2 to
start with), check the error stats through the getsockopt. If no
invalid_tx_desc error, increase N with one and send this new packet.
If you get an error, then the max number of frags is N-1.

> > +* The ZC batch API guarantees that it will provide a batch of Tx
> > +  descriptors that ends with full packet at the end. If not, ZC
> > +  drivers would have to gather the full packet on their side. The
> > +  approach we picked makes ZC drivers' life much easier (at least on
> > +  Tx side).
>
> This seems like it implies some constraint on how an application can use
> the APIs, but it's not quite clear to me what those constraints are, nor
> what happens if an application does something different. This should
> probably be spelled out...
>
> > +On the Rx path in copy-mode, the xsk core copies the XDP data into
> > +multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as
> > +detailed before. Zero-copy mode works the same, though the data is not
> > +copied. When the application gets a descriptor with the XDP_PKT_CONTD
> > +flag set to one, it means that the packet consists of multiple buffers
> > +and it continues with the next buffer in the following
> > +descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it
> > +means that this is the last buffer of the packet. AF_XDP guarantees
> > +that only a complete packet (all frames in the packet) is sent to the
> > +application.
>
> In light of the comment on batch size below, I think it would be useful
> to spell out what this means exactly. IIUC correctly, it means that the
> kernel will check the ringbuffer before starting to copy the data, and
> if there are not enough descriptors available, it will drop the packet
> instead of doing a partial copy, right? And this is the case for both ZC
> and copy mode?

I will make this paragraph and the previous one clearer. And yes, copy
mode and zc mode have the same behaviour.

> > +If application reads a batch of descriptors, using for example the libxdp
> > +interfaces, it is not guaranteed that the batch will end with a full
> > +packet. It might end in the middle of a packet and the rest of the
> > +buffers of that packet will arrive at the beginning of the next batch,
> > +since the libxdp interface does not read the whole ring (unless you
> > +have an enormous batch size or a very small ring size).
> > +
> > +An example program each for Rx and Tx multi-buffer support can be found
> > +later in this document.
> > +
> >  Usage
> >  =====
> >
> > @@ -532,6 +619,96 @@ like this:
> >  But please use the libbpf functions as they are optimized and ready to
> >  use. Will make your life easier.
> >
> > +Usage Multi-Buffer Rx
> > +=====================
> > +
> > +Here is a simple Rx path pseudo-code example (using libxdp interfaces
> > +for simplicity). Error paths have been excluded to keep it short:
> > +
> > +.. code-block:: c
> > +
> > +    void rx_packets(struct xsk_socket_info *xsk)
> > +    {
> > +        static bool new_packet = true;
> > +        u32 idx_rx = 0, idx_fq = 0;
> > +        static char *pkt;
> > +
> > +        int rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx);
> > +
> > +        xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
> > +
> > +        for (int i = 0; i < rcvd; i++) {
> > +            struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++);
> > +            char *frag = xsk_umem__get_data(xsk->umem->buffer, desc->addr);
> > +            bool eop = !(desc->options & XDP_PKT_CONTD);
> > +
> > +        if (new_packet)
> > +            pkt = frag;
> > +        else
> > +            add_frag_to_pkt(pkt, frag);
> > +
> > +        if (eop)
> > +            process_pkt(pkt);
> > +
> > +        new_packet = eop;
> > +
> > +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = desc->addr;
>
> Indentation is off here...

Will fix.

>
> -Toke
>