On 4/28/17 3:58 AM, Jesper Dangaard Brouer wrote:
On Thu, 27 Apr 2017 16:31:14 -0700 Alexei Starovoitov <ast@xxxxxx> wrote:On 4/27/17 1:41 AM, Jesper Dangaard Brouer wrote:When registering/attaching a XDP/bpf program, we would just send the file-descriptor for this port-map along (like we do with the bpf_prog FD). Plus, it own ingress-port number this program is in the port-map. It is not clear to me, in-which-data-structure on the kernel-side we store this reference to the port-map and ingress-port. As today we only have the "raw" struct bpf_prog pointer. I see several options: 1. Create a new xdp_prog struct that contains existing bpf_prog, a port-map pointer and ingress-port. (IMHO easiest solution) 2. Just create a new pointer to port-map and store it in driver rx-ring struct (like existing bpf_prog), but this create a race-challenge replacing (cmpxchg) the program (or perhaps it's not a problem as it runs under rcu and RTNL-lock). 3. Extend bpf_prog to store this port-map and ingress-port, and have a fast-way to access it. I assume it will be accessible via bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP.I'm not sure I completely follow the 3 proposals. Are you suggesting to have only one netdev_array per program?Yes, but I can see you have a more clever idea below.Why not to allow any number like we do for tailcall+prog_array, etc?We can teach verifier to allow new helper bpf_tx_port(netdev_array, port_num); to only be used with netdev_array map type. It will fetch netdevice pointer from netdev_array[port_num] and will tx the packet into it.I love it. I just don't like the "netdev" part of the name "netdev_array" as one basic ideas of a port tabel, is that a port can be anything that can consume a XDP_buff packet. This generalization allow us to move code out of the drivers. We might be on the same page, as I do imagine that netdev_array or port_array is just a struct bpf_map pointer, and the bpf_map->map_type will tell us that this bpf_map contains net_device pointers. Thus, when later introducing a new type of redirect (like to a socket or remote-CPU) then we just add a new bpf_map_type for this, without needing to change anything in the drivers, right?
In theory, yes, but in practice I doubt it will be so easy. We probably shouldn't allow very different types of netdev into the same netdev_array or port_array (whatever the name). They need to be similar enough, otherwise we'd have to do run-time checks. If they're all the same, these checks can be done at insertion time instead.
Do you imagine that bpf-side bpf_tx_port() returns XDP_REDIRECT? Or does it return if the call was successful (e.g validate port_num existed in map)?
don't know :) we need to brainstorm pros and cons.
On the kernel side, we need to receive this info "port_array" and "port_num", given you don't provide the call a xdp_buff/ctx, then I assume you want the per-CPU temp-store solution. Then during the XDP_REDIRECT action we call a core redirect function that based on the bpf_map_type does a lookup, and find the net_device ptr.
hmm. didn't think that far either :) indeed makes sense to pass 'ctx' into such helper as well, so it's easier to deal with original netdev.
We can make it similar to bpf_tail_call(), so that program will finish on successful bpf_tx_port() or make it into 'delayed' tx which will be executed when program finishes. Not sure which approach is better.I know you are talking about something slightly different, about delaying TX. But I want to mention (as I've done before) that it is important (for me) that we get bulking working/integrated. I imagine the driver will call a function that will delay the TX/redirect action and at the end of the NAPI cycle have a function that flush packets, bulk per destination port. I was wondering where to store these delayed TX packets, but now that we have an associated bpf_map data-structure (netdev_array), I'm thinking about storing packets (ordered by port) inside that. And then have a bpf_tx_flush(netdev_array) call in the driver (for every port-table-map seen, which will likely be small).
makes sense to me as well. Ideally we should try to make an api such, that batching or no-batching can be kernel choice. The program will just do xdp_tx_port(...something here...) and the kernel does the best for performance. If it needs to delay the result to do batching the api should allow that transparently. Like right now xdp program does 'return XDP_TX;' and on the kernel side we can either do immediate xmit (like we do today) or can change it to do batching and programs don't need to change.
We can also extend this netdev_array into broadcast/multicast. Like bpf_tx_allports(&netdev_array); call from the program will xmit the packet to all netdevices in that 'netdev_array' map type.When broadcasting you often don't want to broadcast the packet out of the incoming interface. How can you support this? Normally you would know your ingress port, and then excluded that port in the broadcast. But with many netdev_array's how do the program know it's own ingress port.
absolutely! bpf_tx_allports() should somehow exclude the port packet arrived on. What you're proposing about passing 'ctx' into this helper, should solve it, I guess.
Thanks a lot for all this input, I got a much more clear picture of how I can/should implement this :-)
awesome :) Let's brainstorm more and get John's opinion on it as well, since sounds like he'll be heavy user of such api.