From: Simon Horman <horms@xxxxxxxxxx> Date: Sun, 9 Feb 2025 11:03:44 +0000 > On Thu, Feb 06, 2025 at 07:26:29PM +0100, Alexander Lobakin wrote: >> Currently, when your driver supports XSk Tx metadata and you want to >> send an XSk frame, you need to do the following: >> >> * call external xsk_buff_raw_get_dma(); >> * call inline xsk_buff_get_metadata(), which calls external >> xsk_buff_raw_get_data() and then do some inline checks. >> >> This effectively means that the following piece: >> >> addr = pool->unaligned ? xp_unaligned_add_offset_to_addr(addr) : addr; >> >> is done twice per frame, plus you have 2 external calls per frame, plus >> this: >> >> meta = pool->addrs + addr - pool->tx_metadata_len; >> if (unlikely(!xsk_buff_valid_tx_metadata(meta))) >> >> is always inlined, even if there's no meta or it's invalid. >> >> Add xsk_buff_raw_get_ctx() (xp_raw_get_ctx() to be precise) to do that >> in one go. It returns a small structure with 2 fields: DMA address, >> filled unconditionally, and metadata pointer, non-NULL only if it's >> present and valid. The address correction is performed only once and >> you also have only 1 external call per XSk frame, which does all the >> calculations and checks outside of your hotpath. You only need to >> check `if (ctx.meta)` for the metadata presence. >> To not copy any existing code, derive address correction and getting >> virtual and DMA address into small helpers. bloat-o-meter reports no >> object code changes for the existing functionality. >> >> Signed-off-by: Alexander Lobakin <aleksander.lobakin@xxxxxxxxx> > > Hi Alexander, > > I think that this patch needs to be accompanied by at least one > patch that uses xsk_buff_raw_get_ctx() in a driver. This mini-series is the final part of my Chapter III, which was all about prereqs in order to add libeth_xdp and then XDP for idpf. This helper will be used in the next series (Chapter IV) I'll send once this lands. > > Also, as this seems to be an optimisation, some performance data would > be nice too. -1 Kb of object code which has an unrolled-by-8 loop which used this function each iteration. I don't remember the perf numbers since it was a year ago and since then I've been using this helper only, but it was something around a couple procent (which is several hundred Kpps when it comes to XSk). > > Which brings me to my last point. I'd always understood that > returning a struct was discouraged due to performance implications. Rather stack usage, not perf implications. Compound returns are used heavily throughout the kernel code when sizeof(result) <= 16 bytes. Here it's also 16 bytes. Just the same as one __u128. Plus this function doesn't recurse, so the stack won't blow up. > Perhaps that information is out of date, doesn't apply because > the returned struct is so small in this case, or just plain wrong. > But I'd appreciate it if you could add some colour to this. Moreover, the function is global, not inline, so passing a pointer here instead of returning a struct may even behave worse in this case. (and we'll save basically only 8 bytes on the stack, which I don't believe is worth it). Thanks, Olek