Andy Gospodarek <andrew.gospodarek@xxxxxxxxxxxx> writes: > On Thu, Jan 05, 2023 at 04:43:28PM +0100, Toke Høiland-Jørgensen wrote: >> Tariq Toukan <ttoukan.linux@xxxxxxxxx> writes: >> >> > On 04/01/2023 14:28, Toke Høiland-Jørgensen wrote: >> >> Lorenzo Bianconi <lorenzo@xxxxxxxxxx> writes: >> >> >> >>>> On Tue, 03 Jan 2023 16:19:49 +0100 Toke Høiland-Jørgensen wrote: >> >>>>> Hmm, good question! I don't think we've ever explicitly documented any >> >>>>> assumptions one way or the other. My own mental model has certainly >> >>>>> always assumed the first frag would continue to be the same size as in >> >>>>> non-multi-buf packets. >> >>>> >> >>>> Interesting! :) My mental model was closer to GRO by frags >> >>>> so the linear part would have no data, just headers. >> >>> >> >>> That is assumption as well. >> >> >> >> Right, okay, so how many headers? Only Ethernet, or all the way up to >> >> L4 (TCP/UDP)? >> >> >> >> I do seem to recall a discussion around the header/data split for TCP >> >> specifically, but I think I mentally put that down as "something people >> >> may way to do at some point in the future", which is why it hasn't made >> >> it into my own mental model (yet?) :) >> >> >> >> -Toke >> >> >> > >> > I don't think that all the different GRO layers assume having their >> > headers/data in the linear part. IMO they will just perform better if >> > these parts are already there. Otherwise, the GRO flow manages, and >> > pulls the needed amount into the linear part. >> > As examples, see calls to gro_pull_from_frag0 in net/core/gro.c, and the >> > call to pskb_may_pull() from skb_gro_header_slow(). >> > >> > This resembles the bpf_xdp_load_bytes() API used here in the xdp prog. >> >> Right, but that is kernel code; what we end up doing with the API here >> affects how many programs need to make significant changes to work with >> multibuf, and how many can just set the frags flag and continue working. >> Which also has a performance impact, see below. >> >> > The context of my questions is that I'm looking for the right memory >> > scheme for adding xdp-mb support to mlx5e striding RQ. >> > In striding RQ, the RX buffer consists of "strides" of a fixed size set >> > by pthe driver. An incoming packet is written to the buffer starting from >> > the beginning of the next available stride, consuming as much strides as >> > needed. >> > >> > Due to the need for headroom and tailroom, there's no easy way of >> > building the xdp_buf in place (around the packet), so it should go to a >> > side buffer. >> > >> > By using 0-length linear part in a side buffer, I can address two >> > challenging issues: (1) save the in-driver headers memcpy (copy might >> > still exist in the xdp program though), and (2) conform to the >> > "fragments of the same size" requirement/assumption in xdp-mb. >> > Otherwise, if we pull from frag[0] into the linear part, frag[0] becomes >> > smaller than the next fragments. >> >> Right, I see. >> >> So my main concern would be that if we "allow" this, the only way to >> write an interoperable XDP program will be to use bpf_xdp_load_bytes() >> for every packet access. Which will be slower than DPA, so we may end up >> inadvertently slowing down all of the XDP ecosystem, because no one is >> going to bother with writing two versions of their programs. Whereas if >> you can rely on packet headers always being in the linear part, you can >> write a lot of the "look at headers and make a decision" type programs >> using just DPA, and they'll work for multibuf as well. > > The question I would have is what is really the 'slow down' for > bpf_xdp_load_bytes() vs DPA? I know you and Jesper can tell me how many > instructions each use. :) I can try running some benchmarks to compare the two, sure! -Toke