Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Alexei, thank you for taking a look!

On 2019/08/14 10:44, Alexei Starovoitov wrote:
On Tue, Aug 13, 2019 at 09:05:44PM +0900, Toshiaki Makita wrote:
This is a rough PoC for an idea to offload TC flower to XDP.
...
  xdp_flow  TC        ovs kmod
  --------  --------  --------
  4.0 Mpps  1.1 Mpps  1.1 Mpps

Is xdp_flow limited to 4 Mpps due to veth or something else?

Looking at perf, accumulation of each layer's overhead resulted in the number.
With XDP prog which only redirects packets and does not touch the data,
the drop rate is 10 Mpps. In this case the main overhead is XDP's redirect processing
and handling of 2 XDP progs (in veth and i40e).
In the xdp_flow test the overhead additionally includes flow key parse in XDP prog
and hash table lookup (including jhash calculation) which resulted in 4 Mpps.

So xdp_flow drop rate is roughly 4x faster than software TC or ovs kmod.

OTOH the time to add a flow increases with xdp_flow.

ping latency of first packet when veth1 does XDP_PASS instead of DROP:

  xdp_flow  TC        ovs kmod
  --------  --------  --------
  25ms      12ms      0.6ms

xdp_flow does a lot of work to emulate TC behavior including UMH
transaction and multiple bpf map update from UMH which I think increases
the latency.

make sense, but why vanilla TC is so slow ?

No ideas. At least TC requires additional syscall to insert a flow compared to ovs kmod,
but 12 ms looks too long for that.

* Implementation

xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
bpfilter. The difference is that xdp_flow does not generate the eBPF
program dynamically but a prebuilt program is embedded in UMH. This is
mainly because flow insertion is considerably frequent. If we generate
and load an eBPF program on each insertion of a flow, the latency of the
first packet of ping in above test will incease, which I want to avoid.

I think UMH approach is a good fit for this.
Clearly the same algorithm can be done as kernel code or kernel module, but
bpfilter-like UMH is a safer approach.

- patch 9
  Add tc-offload-xdp netdev feature and hooks to call xdp_flow kmod in
  TC flower offload code.

The hook into UMH from TC looks simple. Do you expect the same interface to be
reused from OVS ?

Do you mean openvswitch kernel module by OVS?
If so, no, at this point. TC hook is simple because I reused flow offload mechanism.
OVS kmod does not have offload interface and ovs-vswitchd is using TC for offload.
I wanted to reuse this mechanism for offloading to XDP, so using TC.

* About alternative userland (ovs-vswitchd etc.) implementation

Maybe a similar logic can be implemented in ovs-vswitchd offload
mechanism, instead of adding code to kernel. I just thought offloading
TC is more generic and allows wider usage with direct TC command.

For example, considering that OVS inserts a flow to kernel only when
flow miss happens in kernel, we can in advance add offloaded flows via
tc filter to avoid flow insertion latency for certain sensitive flows.
TC flower usage without using OVS is also possible.

Also as written above nftables can be offloaded to XDP with this
mechanism as well.

Makes sense to me.

   bpf, hashtab: Compare keys in long

3Mpps vs 4Mpps just from this patch ?
or combined with i40 prefech patch ?

Combined.

  drivers/net/ethernet/intel/i40e/i40e_txrx.c  |    1 +

Could you share "perf report" for just hash tab optimization
and for i40 ?

Sure, I'll get some more data and post them.

I haven't seen memcmp to be bottle neck in hash tab.
What is the the of the key?

typo of "size of the key"? IIRC 64 bytes.

Toshiaki Makita



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux