On Mon, 26 Apr 2021 14:01:17 +0800 Hangbin Liu <liuhangbin@xxxxxxxxx> wrote: > On Sat, Apr 24, 2021 at 09:01:29AM +0200, Jesper Dangaard Brouer wrote: > > > > > >> @@ -3942,7 +3960,12 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, > > > > > >> case BPF_MAP_TYPE_DEVMAP: > > > > > >> fallthrough; > > > > > >> case BPF_MAP_TYPE_DEVMAP_HASH: > > > > > >> - err = dev_map_enqueue(fwd, xdp, dev); > > > > > >> + map = xchg(&ri->map, NULL); > > > > > > > > > > > > Hmm, this looks dangerous for performance to have on this fast-path. > > > > > > The xchg call can be expensive, AFAIK this is an atomic operation. > > > > > > > > > > Ugh, you're right. That's my bad, I suggested replacing the > > > > > READ_ONCE()/WRITE_ONCE() pair with the xchg() because an exchange is > > > > > what it's doing, but I failed to consider the performance implications > > > > > of the atomic operation. Sorry about that, Hangbin! I guess this should > > > > > be changed to: > > > > > > > > > > + map = READ_ONCE(ri->map); > > > > > + if (map) { > > > > > + WRITE_ONCE(ri->map, NULL); > > > > > + err = dev_map_enqueue_multi(xdp, dev, map, > > > > > + ri->flags & BPF_F_EXCLUDE_INGRESS); > > > > > + } else { > > > > > + err = dev_map_enqueue(fwd, xdp, dev); > > > > > + } > > > > > > > > This is highly sensitive fast-path code, as you saw Bjørn have been > > > > hunting nanosec in this area. The above code implicitly have "map" as > > > > the likely option, which I don't think it is. > > > > > > Hi Jesper, > > > > > > From the performance data, there is only a slightly impact. Do we still need > > > to block the whole patch on this? Or if you have a better solution? > > > > I'm basically just asking you to add an unlikely() annotation: > > > > map = READ_ONCE(ri->map); > > if (unlikely(map)) { > > WRITE_ONCE(ri->map, NULL); > > err = dev_map_enqueue_multi(xdp, dev, map, [...] > > > > For XDP, performance is the single most important factor! You say your > > performance data, there is only a slightly impact, there must be ZERO > > impact (when your added features is not in use). > > > > You data: > > Version | Test | Generic | Native > > 5.12 rc4 | redirect_map i40e->i40e | 1.9M | 9.6M > > 5.12 rc4 + patch | redirect_map i40e->i40e | 1.9M | 9.3M > > > > The performance difference 9.6M -> 9.3M is a slowdown of 3.36 nanosec. > > Bjørn and others have been working really hard to optimize the code and > > remove down to 1.5 nanosec overheads. Thus, introducing 3.36 nanosec > > added overhead to the fast-path is significant. > > I re-check the performance data. The data > > Version | Test | Generic | Native > > 5.12 rc4 | redirect_map i40e->i40e | 1.9M | 9.6M > > 5.12 rc4 + patch | redirect_map i40e->i40e | 1.9M | 9.3M > > is done on version 5. > > Today I re-did the test, on version 10, with xchg() changed to > READ_ONCE/WRITE_ONCE. Here is the new data (Generic path data was omitted > as there is no change) > > Version | Test | Generic | Native > 5.12 rc4 | redirect_map i40e->i40e | 9.7M > 5.12 rc4 | redirect_map i40e->veth | 11.8M > > 5.12 rc4 + patch | redirect_map i40e->i40e | 9.6M Great to see the baseline redirect_map (i40e->i40e) have almost no impact, only 1.07 ns ((1/9.7-1/9.6)*1000), which is what we want to see. (It might be zero as measurements can fluctuate when diff is below 2ns) > 5.12 rc4 + patch | redirect_map i40e->veth | 11.6M What XDP program are you running on the inner veth? > 5.12 rc4 + patch | redirect_map multi i40e->i40e | 9.5M I'm very surprised to see redirect_map multi being so fast (9.5M vs. 9.6M normal map-redir). I was expecting to see larger overhead, as the code dev_map_enqueue_clone() would clone the packet in xdpf_clone() via allocating a new page (dev_alloc_page) and then doing a memcpy(). Looking closer at this patchset, I realize that the test 'redirect_map-multi' is testing an optimization, and will never call dev_map_enqueue_clone() + xdpf_clone(). IMHO trying to optimize 'redirect_map-multi' to be just as fast as base 'redirect_map' doesn't make much sense. If the 'broadcast' call only send a single packet, then there isn't any reason to call the 'multi' variant. Does the 'selftests/bpf' make sure to activate the code path that does cloning? > 5.12 rc4 + patch | redirect_map multi i40e->veth | 11.5M > 5.12 rc4 + patch | redirect_map multi i40e->mlx4+veth | 3.9M > > And after add unlikely() in the check path, the new data looks like > > Version | Test | Native > 5.12 rc4 + patch | redirect_map i40e->i40e | 9.6M > 5.12 rc4 + patch | redirect_map i40e->veth | 11.7M > 5.12 rc4 + patch | redirect_map multi i40e->i40e | 9.4M > 5.12 rc4 + patch | redirect_map multi i40e->veth | 11.4M > 5.12 rc4 + patch | redirect_map multi i40e->mlx4+veth | 3.8M > > So with unlikely(), the redirect_map is a slightly up, while redirect_map > broadcast has a little drawback. But for the total data it looks this time > there is no much gap compared with no this patch for redirect_map. > > Do you think we still need the unlikely() in check path? Yes. The call to redirect_map multi is allowed (and expected) to be slower, because when using it to broadcast packets we expect that dev_map_enqueue_clone() + xdpf_clone() will get activated, which will be the dominating overhead. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer