On Tue, 20 Jun 2023 at 19:34, Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: > > Maciej Fijalkowski <maciej.fijalkowski@xxxxxxxxx> writes: > > > From: Magnus Karlsson <magnus.karlsson@xxxxxxxxx> > > > > Add AF_XDP multi-buffer support documentation including two > > pseudo-code samples. > > > > Signed-off-by: Magnus Karlsson <magnus.karlsson@xxxxxxxxx> > > --- > > Documentation/networking/af_xdp.rst | 177 ++++++++++++++++++++++++++++ > > 1 file changed, 177 insertions(+) > > > > diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst > > index 247c6c4127e9..2b583f58967b 100644 > > --- a/Documentation/networking/af_xdp.rst > > +++ b/Documentation/networking/af_xdp.rst > > @@ -453,6 +453,93 @@ XDP_OPTIONS getsockopt > > Gets options from an XDP socket. The only one supported so far is > > XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not. > > > > +Multi-Buffer Support > > +-------------------- > > + > > +With multi-buffer support, programs using AF_XDP sockets can receive > > +and transmit packets consisting of multiple buffers both in copy and > > +zero-copy mode. For example, a packet can consist of two > > +frames/buffers, one with the header and the other one with the data, > > +or a 9K Ethernet jumbo frame can be constructed by chaining together > > +three 4K frames. > > + > > +Some definitions: > > + > > +* A packet consists of one or more frames > > + > > +* A descriptor in one of the AF_XDP rings always refers to a single > > + frame. In the case the packet consists of a single frame, the > > + descriptor refers to the whole packet. > > + > > +To enable multi-buffer support for an AF_XDP socket, use the new bind > > +flag XDP_USE_SG. If this is not provided, all multi-buffer packets > > +will be dropped just as before. Note that the XDP program loaded also > > +needs to be in multi-buffer mode. This can be accomplished by using > > +"xdp.frags" as the section name of the XDP program used. > > + > > +To represent a packet consisting of multiple frames, a new flag called > > +XDP_PKT_CONTD is introduced in the options field of the Rx and Tx > > +descriptors. If it is true (1) the packet continues with the next > > +descriptor and if it is false (0) it means this is the last descriptor > > +of the packet. Why the reverse logic of end-of-packet (eop) flag found > > +in many NICs? Just to preserve compatibility with non-multi-buffer > > +applications that have this bit set to false for all packets on Rx, > > +and the apps set the options field to zero for Tx, as anything else > > +will be treated as an invalid descriptor. > > + > > +These are the semantics for producing packets onto AF_XDP Tx ring > > +consisting of multiple frames: > > + > > +* When an invalid descriptor is found, all the other > > + descriptors/frames of this packet are marked as invalid and not > > + completed. The next descriptor is treated as the start of a new > > + packet, even if this was not the intent (because we cannot guess > > + the intent). As before, if your program is producing invalid > > + descriptors you have a bug that must be fixed. > > + > > +* Zero length descriptors are treated as invalid descriptors. > > + > > +* For copy mode, the maximum supported number of frames in a packet is > > + equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all > > + descriptors accumulated so far are dropped and treated as > > + invalid. To produce an application that will work on any system > > + regardless of this config setting, limit the number of frags to 18, > > + as the minimum value of the config is 17. > > + > > +* For zero-copy mode, the limit is up to what the NIC HW > > + supports. Usually at least five on the NICs we have checked. We > > + consciously chose to not enforce a rigid limit (such as > > + CONFIG_MAX_SKB_FRAGS + 1) for zero-copy mode, as it would have > > + resulted in copy actions under the hood to fit into what limit > > + the NIC supports. Kind of defeats the purpose of zero-copy mode. > > How is an application supposed to discover the actual limit for a given > NIC/driver? Thanks for your comments Toke. I will add an example here of how to discover this. Basically you can send a packet with N frags (N = 2 to start with), check the error stats through the getsockopt. If no invalid_tx_desc error, increase N with one and send this new packet. If you get an error, then the max number of frags is N-1. > > +* The ZC batch API guarantees that it will provide a batch of Tx > > + descriptors that ends with full packet at the end. If not, ZC > > + drivers would have to gather the full packet on their side. The > > + approach we picked makes ZC drivers' life much easier (at least on > > + Tx side). > > This seems like it implies some constraint on how an application can use > the APIs, but it's not quite clear to me what those constraints are, nor > what happens if an application does something different. This should > probably be spelled out... > > > +On the Rx path in copy-mode, the xsk core copies the XDP data into > > +multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as > > +detailed before. Zero-copy mode works the same, though the data is not > > +copied. When the application gets a descriptor with the XDP_PKT_CONTD > > +flag set to one, it means that the packet consists of multiple buffers > > +and it continues with the next buffer in the following > > +descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it > > +means that this is the last buffer of the packet. AF_XDP guarantees > > +that only a complete packet (all frames in the packet) is sent to the > > +application. > > In light of the comment on batch size below, I think it would be useful > to spell out what this means exactly. IIUC correctly, it means that the > kernel will check the ringbuffer before starting to copy the data, and > if there are not enough descriptors available, it will drop the packet > instead of doing a partial copy, right? And this is the case for both ZC > and copy mode? I will make this paragraph and the previous one clearer. And yes, copy mode and zc mode have the same behaviour. > > +If application reads a batch of descriptors, using for example the libxdp > > +interfaces, it is not guaranteed that the batch will end with a full > > +packet. It might end in the middle of a packet and the rest of the > > +buffers of that packet will arrive at the beginning of the next batch, > > +since the libxdp interface does not read the whole ring (unless you > > +have an enormous batch size or a very small ring size). > > + > > +An example program each for Rx and Tx multi-buffer support can be found > > +later in this document. > > + > > Usage > > ===== > > > > @@ -532,6 +619,96 @@ like this: > > But please use the libbpf functions as they are optimized and ready to > > use. Will make your life easier. > > > > +Usage Multi-Buffer Rx > > +===================== > > + > > +Here is a simple Rx path pseudo-code example (using libxdp interfaces > > +for simplicity). Error paths have been excluded to keep it short: > > + > > +.. code-block:: c > > + > > + void rx_packets(struct xsk_socket_info *xsk) > > + { > > + static bool new_packet = true; > > + u32 idx_rx = 0, idx_fq = 0; > > + static char *pkt; > > + > > + int rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx); > > + > > + xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq); > > + > > + for (int i = 0; i < rcvd; i++) { > > + struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++); > > + char *frag = xsk_umem__get_data(xsk->umem->buffer, desc->addr); > > + bool eop = !(desc->options & XDP_PKT_CONTD); > > + > > + if (new_packet) > > + pkt = frag; > > + else > > + add_frag_to_pkt(pkt, frag); > > + > > + if (eop) > > + process_pkt(pkt); > > + > > + new_packet = eop; > > + > > + *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = desc->addr; > > Indentation is off here... Will fix. > > -Toke >