On Wed, Nov 23, 2022 at 11:52:12AM -0800, sdf@xxxxxxxxxx wrote: > On 11/23, Jakub Kicinski wrote: > > On Wed, 23 Nov 2022 10:26:41 -0800 Stanislav Fomichev wrote: > > > > This embedding trick works for drivers that put xdp_buff on the stack, > > > > but mlx5 supports XSK zerocopy, which uses the xsk_buff_pool for > > > > allocating them. This makes it a bit awkward to do the same thing > > there; > > > > and since it's probably going to be fairly common to do something like > > > > this, how about we just add a 'void *drv_priv' pointer to struct > > > > xdp_buff that the drivers can use? The xdp_buff already takes up a > > full > > > > cache line anyway, so any data stuffed after it will spill over to a > > new > > > > one; so I don't think there's much difference performance-wise. > > > > > > I guess the alternative is to extend xsk_buff_pool with some new > > > argument for xdp_buff tailroom? (so it can kmalloc(sizeof(xdp_buff) + > > > xdp_buff_tailroom)) > > > But it seems messy because there is no way of knowing what the target > > > device's tailroom is, so it has to be a user setting :-/ > > > I've started with a priv pointer in xdp_buff initially, it seems fine > > > to go back. I'll probably convert veth/mlx4 to the same mode as well > > > to avoid having different approaches in different places.. > > > Can we not do this please? Add 16B of "private driver space" after > > the xdp_buff in xdp_buff_xsk (we have 16B to full cacheline), the It is time to jump the hints train I guess:D We have 8 bytes left in the cacheline that xdp_buff occupies - pahole output below shows that cb spans through two cachelines. Did you mean something else though? > > drivers decide how they use it. Drivers can do BUILD_BUG_ON() for their > > expected size and cast that to whatever struct they want. This is how > > various offloads work, the variable size tailroom would be an over > > design IMO. > > > And this way non XSK paths can keep its normal typing. > > Good idea, prototyped below, lmk if it that's not what you had in mind. > > struct xdp_buff_xsk { > struct xdp_buff xdp; /* 0 56 */ > u8 cb[16]; /* 56 16 */ > /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */ > dma_addr_t dma; /* 72 8 */ > dma_addr_t frame_dma; /* 80 8 */ > struct xsk_buff_pool * pool; /* 88 8 */ > u64 orig_addr; /* 96 8 */ > struct list_head free_list_node; /* 104 16 */ > > /* size: 120, cachelines: 2, members: 7 */ > /* last cacheline: 56 bytes */ > }; > > Toke, I can try to merge this into your patch + keep your SoB (or feel free > to try this and retest yourself, whatever works). > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h > b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h > index bc2d9034af5b..837bf103b871 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h > @@ -44,6 +44,11 @@ > (MLX5E_XDP_INLINE_WQE_MAX_DS_CNT * MLX5_SEND_WQE_DS - \ > sizeof(struct mlx5_wqe_inline_seg)) > > +struct mlx5_xdp_cb { > + struct mlx5_cqe64 *cqe; > + struct mlx5e_rq *rq; > +}; > + > struct mlx5e_xsk_param; > int mlx5e_xdp_max_mtu(struct mlx5e_params *params, struct mlx5e_xsk_param > *xsk); > bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct page *page, > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c > b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c > index c91b54d9ff27..84d23b2da7ce 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c > @@ -5,6 +5,7 @@ > #include "en/xdp.h" > #include <net/xdp_sock_drv.h> > #include <linux/filter.h> > +#include <linux/build_bug.h> > > /* RX data path */ > > @@ -286,8 +287,14 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct > mlx5e_rq *rq, > u32 cqe_bcnt) > { > struct xdp_buff *xdp = wi->au->xsk; > + struct mlx5_xdp_cb *cb; > struct bpf_prog *prog; > > + BUILD_BUG_ON(sizeof(struct mlx5_xdp_cb) > XSKB_CB_SIZE); > + cb = xp_get_cb(xdp); > + cb->cqe = NULL /*cqe*/; > + cb->rq = rq; I believe that these could be set once at a setup time within a pool - take a look at xsk_pool_set_rxq_info(). This will save us cycles so that we will skip assignments per each processed xdp_buff. AF_XDP ZC performance comes in a major part from the fact that thanks to xsk_buff_pool we have less work to do per each processed buffer. > + > /* wi->offset is not used in this function, because xdp->data and the > * DMA address point directly to the necessary place. Furthermore, the > * XSK allocator allocates frames per packet, instead of pages, so > diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h > index f787c3f524b0..b298590429e7 100644 > --- a/include/net/xsk_buff_pool.h > +++ b/include/net/xsk_buff_pool.h > @@ -19,8 +19,11 @@ struct xdp_sock; > struct device; > struct page; > > +#define XSKB_CB_SIZE 16 > + > struct xdp_buff_xsk { > struct xdp_buff xdp; > + u8 cb[XSKB_CB_SIZE]; /* Private area for the drivers to use. */ > dma_addr_t dma; > dma_addr_t frame_dma; > struct xsk_buff_pool *pool; > @@ -143,6 +146,11 @@ static inline dma_addr_t xp_get_frame_dma(struct > xdp_buff_xsk *xskb) > return xskb->frame_dma; > } > > +static inline void *xp_get_cb(struct xdp_buff *xdp) > +{ > + return (void *)xdp + offsetof(struct xdp_buff_xsk, cb); > +} This should have a wrapper in include/net/xdp_sock_drv.h that drivers will call. Generally I think this should fly but I'm not sure about cb being 16 bytes. > + > void xp_dma_sync_for_cpu_slow(struct xdp_buff_xsk *xskb); > static inline void xp_dma_sync_for_cpu(struct xdp_buff_xsk *xskb) > { > > > > > I'll send my patch to add support to mlx5 (using the drv_priv pointer > > > > approach) separately. > > > > > > Saw them, thanks! Will include them in v3+.