On Thu, Sep 19, 2019 at 7:45 PM Matt Cover <werekraken@xxxxxxxxx> wrote: > > On Thu, Sep 19, 2019 at 6:42 PM Jason Wang <jasowang@xxxxxxxxxx> wrote: > > > > > > On 2019/9/20 上午8:05, Matt Cover wrote: > > > On Thu, Sep 19, 2019 at 3:45 PM Matthew Cover <werekraken@xxxxxxxxx> wrote: > > >> WORK IN PROGRESS: > > >> * bpf program loading works! > > >> * txq steering via bpf program return code works! > > >> * bpf program unloading not working. > > >> * bpf program attached query not working. > > >> > > >> This patch set provides a bpf hookpoint with goals similar to, but a more > > >> generic implementation than, TUNSETSTEERINGEBPF; userspace supplied tx queue > > >> selection policy. > > > > > > One point that I introduce TUNSETSTEERINGEBPF instead of using a generic > > way like cls/act bpf is that I need make sure to have a consistent API > > with macvtap. > > > > In the case of macvtap, TX means transmit from userspace to kernel, but > > for TUN, it means transmit from kernel to userspace. > > > > Ah, ok. I'll have to check that out at some point. > > > > > >> > > >> TUNSETSTEERINGEBPF is a useful bpf hookpoint, but has some drawbacks. > > >> > > >> First, it only works on tun/tap devices. > > >> > > >> Second, there is no way in the current TUNSETSTEERINGEBPF implementation > > >> to bail out or load a noop bpf prog and fallback to the no prog tx queue > > >> selection method. > > > > > > I believe it expect that eBPF should take all the parts (even the > > fallback part). > > > > This would be easy to change in the existing TUNSETSTEERINGEBPF > implementation if desired. We'd just need a negative return from the bpf prog > to result in falling back to tun_automq_select_queue(). If that behavior > sounds reasonable to you, I can look into that as a separate patch. > > > > > >> > > >> Third, the TUNSETSTEERINGEBPF interface seems to require possession of existing > > >> or creation of new queues/fds. > > > > > > That's the way TUN work for past +10 years because ioctl is the only way > > to do configuration and it requires a fd to carry that. David suggest to > > implement netlink but nobody did that. > > > > I see. > > > > > >> > > >> This most naturally fits in the "wire" implementation since possession of fds > > >> is ensured. However, it also means the various "wire" implementations (e.g. > > >> qemu) have to all be made aware of TUNSETSTEERINGEBPF and expose an interface > > >> to load/unload a bpf prog (or provide a mechanism to pass an fd to another > > >> program). > > > > > > The load/unload of ebpf program is standard bpf() syscall. Ioctl just > > attach that to TUN. This idea is borrowed from packet socket which the > > bpf program was attached through setsockopt(). > > > > Yeah, it doesn't take much code to load a prog. I wrote one earlier this week > in fact which spins up an extra fd and detaches right after. > > > > > >> > > >> Alternatively, you can spin up an extra queue and immediately disable via > > >> IFF_DETACH_QUEUE, but this seems unsafe; packets could be enqueued to this > > >> extra file descriptor which is part of our bpf prog loader, not our "wire". > > > > > > You can use you 'wire' queue to do ioctl, but we can invent other API. > > > > It might be cool to provide a way to create an already detached fd > (not sure if this > is non-trivial for some reason). Switching over to netlink could be > the more long > term goal. > > > > > >> > > >> Placing this in the XPS code and leveraging iproute2 and rtnetlink to provide > > >> our bpf prog loader in a similar manner to xdp gives us a nice way to separate > > >> the tap "wire" and the loading of tx queue selection policy. It also lets us > > >> use this hookpoint for any device traversing XPS. > > >> > > >> This patch only introduces the new hookpoint to the XPS code and will not yet > > >> be used by tun/tap devices using the intree tun.ko (which implements an > > >> .ndo_select_queue and does not traverse the XPS code). > > >> > > >> In a future patch set, we can optionally refactor tun.ko to traverse this call > > >> to bpf_prog_run_clear_cb() and bpf prog storage. tun/tap devices could then > > >> leverage iproute2 as a generic loader. The TUNSETSTEERINGEBPF interface could > > >> at this point be optionally deprecated/removed. > > > > > > As described above, we need it for macvtap and you propose here can not > > work for that. > > > > I'm not against this proposal, just want to clarify some considerations > > when developing TUNSETSTEERINGEPF. The main goal is for VM to implement > > sophisticated steering policy like RSS without touching kernel. > > > > Very cool. Thank you for your comments Jason; they have added clarity > to some things. > > I'm still interested in adding this hookpoint, community willing. I > believe it provides > value beyond xps_cpus/xps_rxqs. > > I also plan to look into adding a similar hookpoint in the rps code. > That will unlock > additional possibilities for this xps hookpoint (e.g. rfs implemented > via bpf maps, but > only on a subset of traffic [high priority or especially resource > costly] rather than all). > > I've had (so far casual) chats with a couple NIC vendors about various > "SmartNICs" supporting custom entropy fields for RSS. I'm playing with the idea > of an "rpsoffload" prog loaded into the NIC being the way custom entropy is > configured. Being able to configure RSS to generate a hash based on an fields > of an inner packet or a packet type specific field like GRE key would be super > nice for NFV workloads. > Turns out the RSS part is already being done via XDP! https://github.com/Netronome/bpf-samples/tree/master/programmable_rss > Perhaps even an "rpsdrv" or "rpsoffload" hookpoint could leverage bpf > helpers for > RSS hash algorithm (e.g. bfp_rss_hash_toeplitz(), bpf_rss_hash_crc(), > bpf_rss_hash_xor(), etc.). > > The ideas on how things would look for receive are still early, but I > think there is > a lot of potential for making things more flexible by leveraging ebpf > in this area. > > > Thanks > > > > > > >> > > >> Both patches in this set have been tested using a rebuilt tun.ko with no > > >> .ndo_select_queue. > > >> > > >> sed -i '/\.ndo_select_queue.*=/d' drivers/net/tun.c > > >> > > >> The tap device was instantiated using tap_mq_pong.c, supporting scripts, and > > >> wrapping service found here: > > >> > > >> https://github.com/stackpath/rxtxcpu/tree/v1.2.6/helpers > > >> > > >> The bpf prog source and test scripts can be found here: > > >> > > >> https://github.com/werekraken/xps_ebpf > > >> > > >> In nstxq, netsniff-ng using PACKET_FANOUT_QM is leveraged to check the > > >> queue_mapping. > > >> > > >> With no prog loaded, the tx queue selection is adhering our xps_cpus > > >> configuration. > > >> > > >> [vagrant@localhost ~]$ grep . /sys/class/net/tap0/queues/tx-*/xps_cpus; ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe; > > >> /sys/class/net/tap0/queues/tx-0/xps_cpus:1 > > >> /sys/class/net/tap0/queues/tx-1/xps_cpus:2 > > >> cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.146 ms > > >> cpu0: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0 > > >> cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.121 ms > > >> cpu1: qm1: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0 > > >> > > >> With a return 0 bpg prog, our tx queue is 0 (despite xps_cpus). > > >> > > >> [vagrant@localhost ~]$ sudo ip link set dev tap0 xps obj hello0.o sec hello && { ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe; } > > >> cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.160 ms > > >> cpu0: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0 > > >> cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.124 ms > > >> cpu1: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0 > > >> ping-4852 [000] .... 2691.633260: 0: xps (RET 0): Hello, World! > > >> ping-4869 [001] .... 2695.753588: 0: xps (RET 0): Hello, World! > > >> > > >> With a return 1 bpg prog, our tx queue is 1. > > >> > > >> [vagrant@localhost ~]$ sudo ip link set dev tap0 xps obj hello1.o sec hello && { ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe; } > > >> cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.193 ms > > >> cpu0: qm1: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0 > > >> cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.135 ms > > >> cpu1: qm1: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0 > > >> ping-4894 [000] .... 2710.652080: 0: xps (RET 1): Hello, World! > > >> ping-4911 [001] .... 2714.774608: 0: xps (RET 1): Hello, World! > > >> > > >> With a return 2 bpg prog, our tx queue is 0 (we only have 2 tx queues). > > >> > > >> [vagrant@localhost ~]$ sudo ip link set dev tap0 xps obj hello2.o sec hello && { ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe; } > > >> cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=1.20 ms > > >> cpu0: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0 > > >> cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.986 ms > > >> cpu1: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0 > > >> ping-4936 [000] .... 2729.442668: 0: xps (RET 2): Hello, World! > > >> ping-4953 [001] .... 2733.614558: 0: xps (RET 2): Hello, World! > > >> > > >> With a return -1 bpf prog, our tx queue selection is once again determined by > > >> xps_cpus. Any negative return should work the same and provides a nice > > >> mechanism to bail out or have a noop bpf prog at this hookpoint. > > >> > > >> [vagrant@localhost ~]$ sudo ip link set dev tap0 xps obj hello_neg1.o sec hello && { ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe; } > > >> cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.628 ms > > >> cpu0: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0 > > >> cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.322 ms > > >> cpu1: qm1: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0 > > >> ping-4981 [000] .... 2763.510760: 0: xps (RET -1): Hello, World! > > >> ping-4998 [001] .... 2767.632583: 0: xps (RET -1): Hello, World! > > >> > > >> bpf prog unloading is not yet working and neither does `ip link show` report > > >> when an "xps" bpf prog is attached. This is my first time touching iproute2 or > > >> rtnetlink, so it may be something obvious to those more familiar. > > > Adding Jason... sorry for missing that the first time.