> On 2023/5/17 6:52, Lorenzo Bianconi wrote: > >> On Mon, May 15, 2023 at 01:24:20PM +0200, Lorenzo Bianconi wrote: > >>>> On 2023/5/12 21:08, Lorenzo Bianconi wrote: > >>>>> In order to reduce page_pool memory footprint, rely on > >>>>> page_pool_dev_alloc_frag routine and reduce buffer size > >>>>> (VETH_PAGE_POOL_FRAG_SIZE) to PAGE_SIZE / 2 in order to consume one page > >>>> > >>>> Is there any performance improvement beside the memory saving? As it > >>>> should reduce TLB miss, I wonder if the TLB miss reducing can even > >>>> out the cost of the extra frag reference count handling for the > >>>> frag support? > >>> > >>> reducing the requested headroom to 192 (from 256) we have a nice improvement in > >>> the 1500B frame case while it is mostly the same in the case of paged skb > >>> (e.g. MTU 8000B). > >> > >> Can you define 'nice improvement' ? ;) > >> Show us numbers or improvement in %. > > > > I am testing this RFC patch in the scenario reported below: > > > > iperf tcp tx --> veth0 --> veth1 (xdp_pass) --> iperf tcp rx > > > > - 6.4.0-rc1 net-next: > > MTU 1500B: ~ 7.07 Gbps > > MTU 8000B: ~ 14.7 Gbps > > > > - 6.4.0-rc1 net-next + page_pool frag support in veth: > > MTU 1500B: ~ 8.57 Gbps > > MTU 8000B: ~ 14.5 Gbps > > > > Thanks for sharing the data. > Maybe using the new frag interface introduced in [1] bring > back the performance for the MTU 8000B case. > > 1. https://patchwork.kernel.org/project/netdevbpf/cover/20230516124801.2465-1-linyunsheng@xxxxxxxxxx/ > > > I drafted a patch for veth to use the new frag interface, maybe that > will show how veth can make use of it. Would you give it a try to see > if there is any performance improvment for MTU 8000B case? Thanks. > > --- a/drivers/net/veth.c > +++ b/drivers/net/veth.c > @@ -737,8 +737,8 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq, > skb_shinfo(skb)->nr_frags || > skb_headroom(skb) < XDP_PACKET_HEADROOM) { > u32 size, len, max_head_size, off; > + struct page_pool_frag *pp_frag; > struct sk_buff *nskb; > - struct page *page; > int i, head_off; > > /* We need a private copy of the skb and data buffers since > @@ -752,14 +752,20 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq, > if (skb->len > PAGE_SIZE * MAX_SKB_FRAGS + max_head_size) > goto drop; > > + size = min_t(u32, skb->len, max_head_size); > + size += VETH_XDP_HEADROOM; > + size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); > + > /* Allocate skb head */ > - page = page_pool_dev_alloc_pages(rq->page_pool); > - if (!page) > + pp_frag = page_pool_dev_alloc_frag(rq->page_pool, size); > + if (!pp_frag) > goto drop; > > - nskb = napi_build_skb(page_address(page), PAGE_SIZE); > + nskb = napi_build_skb(page_address(pp_frag->page) + pp_frag->offset, > + pp_frag->truesize); > if (!nskb) { > - page_pool_put_full_page(rq->page_pool, page, true); > + page_pool_put_full_page(rq->page_pool, pp_frag->page, > + true); > goto drop; > } > > @@ -782,16 +788,18 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq, > len = skb->len - off; > > for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) { > - page = page_pool_dev_alloc_pages(rq->page_pool); > - if (!page) { > + size = min_t(u32, len, PAGE_SIZE); > + > + pp_frag = page_pool_dev_alloc_frag(rq->page_pool, size); > + if (!pp_frag) { > consume_skb(nskb); > goto drop; > } > > - size = min_t(u32, len, PAGE_SIZE); > - skb_add_rx_frag(nskb, i, page, 0, size, PAGE_SIZE); > - if (skb_copy_bits(skb, off, page_address(page), > - size)) { > + skb_add_rx_frag(nskb, i, pp_frag->page, pp_frag->offset, > + size, pp_frag->truesize); > + if (skb_copy_bits(skb, off, page_address(pp_frag->page) + > + pp_frag->offset, size)) { > consume_skb(nskb); > goto drop; > } > @@ -1047,6 +1055,8 @@ static int veth_create_page_pool(struct veth_rq *rq) > return err; > } IIUC the code here we are using a variable length for linear part (at most one page) while we will always use a full page (exept for the last fragment) for the paged area, correct? I have not tested it yet but I do not think we will get a significant improvement since if we set MTU to 8000B in my tests we get mostly the same throughput (14.5 Gbps vs 14.7 Gbps) if we use page_pool fragment or page_pool full page. Am I missing something? What we are discussing with Jesper is try to allocate a order 3 page from the pool and rely page_pool fragment, similar to page_frag_cache is doing. I will look into it if there are no strong 'red flags'. Regards, Lorenzo > > + page_pool_set_max_frag_size(rq->page_pool, PAGE_SIZE / 2); > + > return 0; > } >
Attachment:
signature.asc
Description: PGP signature