Maciej Fijalkowski <maciej.fijalkowski@xxxxxxxxx> writes: > On Fri, Jan 22, 2021 at 03:46:47PM +0800, Hangbin Liu wrote: >> From: Jesper Dangaard Brouer <brouer@xxxxxxxxxx> >> >> This changes the devmap XDP program support to run the program when the >> bulk queue is flushed instead of before the frame is enqueued. This has >> a couple of benefits: >> >> - It "sorts" the packets by destination devmap entry, and then runs the >> same BPF program on all the packets in sequence. This ensures that we >> keep the XDP program and destination device properties hot in I-cache. >> >> - It makes the multicast implementation simpler because it can just >> enqueue packets using bq_enqueue() without having to deal with the >> devmap program at all. >> >> The drawback is that if the devmap program drops the packet, the enqueue >> step is redundant. However, arguably this is mostly visible in a >> micro-benchmark, and with more mixed traffic the I-cache benefit should >> win out. The performance impact of just this patch is as follows: >> >> Using xdp_redirect_map(with a 2nd xdp_prog patch[1]) in sample/bpf and send >> pkts via pktgen cmd: >> ./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 -s 64 >> >> There are about +/- 0.1M deviation for native testing, the performance >> improved for the base-case, but some drop back with xdp devmap prog attached. >> >> Version | Test | Generic | Native | Native + 2nd xdp_prog >> 5.10 rc6 | xdp_redirect_map i40e->i40e | 2.0M | 9.1M | 8.0M >> 5.10 rc6 | xdp_redirect_map i40e->veth | 1.7M | 11.0M | 9.7M >> 5.10 rc6 + patch | xdp_redirect_map i40e->i40e | 2.0M | 9.5M | 7.5M >> 5.10 rc6 + patch | xdp_redirect_map i40e->veth | 1.7M | 11.6M | 9.1M >> >> [1] https://lore.kernel.org/bpf/20210122025007.2968381-1-liuhangbin@xxxxxxxxx >> >> Signed-off-by: Jesper Dangaard Brouer <brouer@xxxxxxxxxx> >> Signed-off-by: Hangbin Liu <liuhangbin@xxxxxxxxx> >> >> --- >> v16: >> a) refactor bq_xmit_all logic and remove error label >> >> v15: >> a) do not use unlikely when checking bq->xdp_prog >> b) return sent frames for dev_map_bpf_prog_run() >> >> v14: no update, only rebase the code >> v13: pass in xdp_prog through __xdp_enqueue() >> v2-v12: no this patch >> --- >> kernel/bpf/devmap.c | 136 ++++++++++++++++++++++++++------------------ >> 1 file changed, 81 insertions(+), 55 deletions(-) >> >> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c >> index f6e9c68afdd4..c24fcffbbfad 100644 >> --- a/kernel/bpf/devmap.c >> +++ b/kernel/bpf/devmap.c >> @@ -57,6 +57,7 @@ struct xdp_dev_bulk_queue { >> struct list_head flush_node; >> struct net_device *dev; >> struct net_device *dev_rx; >> + struct bpf_prog *xdp_prog; >> unsigned int count; >> }; >> >> @@ -327,46 +328,95 @@ bool dev_map_can_have_prog(struct bpf_map *map) >> return false; >> } >> >> +static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog, >> + struct xdp_frame **frames, int n, >> + struct net_device *dev) >> +{ >> + struct xdp_txq_info txq = { .dev = dev }; >> + struct xdp_buff xdp; >> + int i, nframes = 0; >> + >> + for (i = 0; i < n; i++) { >> + struct xdp_frame *xdpf = frames[i]; >> + u32 act; >> + int err; >> + >> + xdp_convert_frame_to_buff(xdpf, &xdp); >> + xdp.txq = &txq; >> + >> + act = bpf_prog_run_xdp(xdp_prog, &xdp); >> + switch (act) { >> + case XDP_PASS: >> + err = xdp_update_frame_from_buff(&xdp, xdpf); >> + if (unlikely(err < 0)) >> + xdp_return_frame_rx_napi(xdpf); >> + else >> + frames[nframes++] = xdpf; >> + break; >> + default: >> + bpf_warn_invalid_xdp_action(act); >> + fallthrough; >> + case XDP_ABORTED: >> + trace_xdp_exception(dev, xdp_prog, act); >> + fallthrough; >> + case XDP_DROP: >> + xdp_return_frame_rx_napi(xdpf); >> + break; >> + } >> + } >> + return nframes; /* sent frames count */ >> +} >> + >> static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags) >> { >> struct net_device *dev = bq->dev; >> - int sent = 0, drops = 0, err = 0; >> + unsigned int cnt = bq->count; >> + int drops = 0, err = 0; >> + int to_sent = cnt; > > Hmm if I would be super picky then I'd like to have this variable as > "to_send", as we spoke. Could you change that? > > With that, you can add my: > > Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@xxxxxxxxx> > > to next revision. > >> + int sent = cnt; >> int i; >> >> - if (unlikely(!bq->count)) >> + if (unlikely(!cnt)) >> return; >> >> - for (i = 0; i < bq->count; i++) { >> + for (i = 0; i < cnt; i++) { >> struct xdp_frame *xdpf = bq->q[i]; >> >> prefetch(xdpf); >> } >> >> - sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags); >> + if (bq->xdp_prog) { >> + to_sent = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev); >> + if (!to_sent) { >> + sent = 0; >> + goto out; >> + } >> + drops = cnt - to_sent; >> + } >> + >> + sent = dev->netdev_ops->ndo_xdp_xmit(dev, to_sent, bq->q, flags); >> if (sent < 0) { >> err = sent; >> sent = 0; >> - goto error; >> + >> + /* If ndo_xdp_xmit fails with an errno, no frames have been >> + * xmit'ed and it's our responsibility to them free all. >> + */ >> + for (i = 0; i < cnt - drops; i++) { >> + struct xdp_frame *xdpf = bq->q[i]; >> + >> + xdp_return_frame_rx_napi(xdpf); >> + } >> } >> - drops = bq->count - sent; >> out: >> + drops = cnt - sent; >> bq->count = 0; >> >> trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, drops, err); >> bq->dev_rx = NULL; >> + bq->xdp_prog = NULL; > > One more question, do you really have to do that per each bq_xmit_all > call? Couldn't you clear it in __dev_flush ? > > Or IOW - what's the rationale behind storing xdp_prog in > xdp_dev_bulk_queue. Why can't you propagate the dst->xdp_prog and rely on > that without that local pointer? > > You probably have an answer for that, so maybe include it in commit > message. > > BTW same question for clearing dev_rx. To me this will be the same for all > bq_xmit_all() calls that will happen within same napi. I think you're right: When bq_xmit_all() is called from bq_enqueue(), another packet will always be enqueued immediately after, so clearing out all of those things in bq_xmit_all() is redundant. This also includes the list_del on bq->flush_node, BTW. And while we're getting into e micro-optimisations: In bq_enqueue() we have two checks: if (!bq->dev_rx) bq->dev_rx = dev_rx; bq->q[bq->count++] = xdpf; if (!bq->flush_node.prev) list_add(&bq->flush_node, flush_list); those two if() checks can be collapsed into one, since the list and the dev_rx field are only ever modified together. This will also be the case for bq->xdp_prog, so putting all three under the same check in bq_enqueue() and only clearing them in __dev_flush() would be a win, I suppose - nice catch! :) -Toke