> > Add core AF_XDP support for chunk sizes larger than PAGE_SIZE. This > > enables sending/receiving jumbo ethernet frames up to the theoretical > > maxiumum of 64 KiB. For chunk sizes > PAGE_SIZE, the UMEM is required > > to consist of HugeTLB VMAs (and be hugepage aligned). Initially, only > > XDP_COPY mode is usuable pending future driver work. > > nit: useable Fixed in v2. > > > For consistency, check for HugeTLB pages during UMEM registration. This > > implies that hugepages are required for XDP_COPY mode despite DMA not > > being used. This restriction is desirable since it ensures user software > > can take advantage of future driver support. > > > > Even in HugeTLB mode, continue to do page accounting using order-0 > > (4 KiB) pages. This minimizes the size of this change and reduces the > > risk of impacting driver code. Taking full advantage of hugepages for > > accounting should improve XDP performance in the general case. > > Thank you Kal for working on this. Interesting stuff. > > First some general comments and questions on the patch set: > > * Please document this new feature in Documentation/networking/af_xdp.rst Fixed in v2. > * Have you verified the SKB path for Rx? Tx was exercised by running l2fwd. This patchset allows sending/receiving 9000 MTU packets with xdpsock (slightly modified). The benchmark numbers show the results for rxdrop (-r). > * Have you checked that an XDP program can access the full >4K packet? > The xdp_buff has no problem with this as the buffer is consecutive, > but just wondering if there is some other check or limit in there? > Jesper and Toke will likely know, so roping them in. Yes, the full packet can be accessed from a SEC("xdp") BPF program (only tested in SKB mode). > * Would be interesting to know your thoughts about taking this to > zero-copy mode too. It would be good if you could support all modes > from the get go, instead of partial support for some unknown amount of > time and then zero-copy support. Partial support makes using the > feature more cumbersome when an app is deployed on various systems. > The continuity checking code you have at the end of the patch is a > step in the direction of zero-copy support, it seems. I think this patchset is enough to support zero-copy as long as the driver allows it. Currently, no drivers will work out of the box AFAIK since they all validate the chunk_size or the MTU size. I would absolutely love for drivers to support this. Hopefully this patchset is enough inspiration? :-) Do you think it's absolutely necessary to have driver ZC support ready to land this? > * What happens if I try to run this in zero-copy mode with a chunk_size > 4K? AFAIK drivers check for this and throw an error. Maybe there are some drivers that don't check this properly? > * There are some compilation errors to fix from the kernel test robot Fixed in v2. > > require_hugetlb would be a clearer name Fixed in v2. Renamed to `need_hugetlb`. > > next_mapping? n as a name is not very descriptive. Fixed in v2. Renamed to `stride`. > > > u32 i; > > > > - for (i = 0; i < dma_map->dma_pages_cnt - 1; i++) { > > - if (dma_map->dma_pages[i] + PAGE_SIZE == dma_map->dma_pages[i + 1]) > > + for (i = 0; i + n < dma_map->dma_pages_cnt; i++) { > > + if (dma_map->dma_pages[i] + page_size == dma_map->dma_pages[i + n]) > > dma_map->dma_pages[i] |= XSK_NEXT_PG_CONTIG_MASK; > > else > > dma_map->dma_pages[i] &= ~XSK_NEXT_PG_CONTIG_MASK; > > } > > + for (; i < dma_map->dma_pages_cnt; i++) > > + dma_map->dma_pages[i] &= ~XSK_NEXT_PG_CONTIG_MASK; > > Is this not too conservative? If your umem consists of two huge pages > mappings but they are not mapped consecutively in physical memory, you > are going to mark all the chunks as non-consecutive. Would it not be > better to just look chunk_size ahead of you instead of page_size > above? The only thing you care about is that the chunk you are in is > in consecutive physical memory, and that is strictly only true for > zero-copy mode. So this seems to be in preparation for zero-copy mode. > It is slightly too conservative. I have updated the logic a bit in v2. If the packet doesn't cross a page boundary, then this array is not read anyway. Thanks! Kal