On Wed, 26 Apr 2017 16:55:44 -0400 Andy Gospodarek <andy@xxxxxxxxxxxxx> wrote: > On Wed, Apr 26, 2017 at 10:58:45AM -0700, Alexei Starovoitov wrote: > > On 4/26/17 9:35 AM, John Fastabend wrote: > > > > > > > As Alexei also mentioned before, ifindex vs port makes no real > > > > difference seen from the bpf program side. It is userspace's > > > > responsibility to add ifindex/port's to the bpf-maps, according to how > > > > the bpf program "policy" want to "connect" these ports. The > > > > port-table system add one extra step, of also adding this port to the > > > > port-table (which lives inside the kernel). > > > > > > > > > > I'm not sure I understand the "lives inside the kernel" bit. I assumed > > > the 'map' should be a bpf map and behave like any other bpf map. > > > > > > I wanted a new map to be defined, something like this from the bpf programmer > > > side. > > > > > > struct bpf_map_def SEC("maps") port_table = > > > .type = BPF_MAP_TYPE_PORT_CONNECTION, > > > .key_size = sizeof(u32), > > > .value_size = BPF_PORT_CONNECTION_SIZE, > > > .max_entries = 256, > > > }; > > > > I like the idea. > > We have prog_array, perf_event_array, cgroup_array map specializations. > > This one can be new netdev_array with some new bpf_redirect-like helper > > accessing it. > > > > > > When loading the XDP program, we also need to pass along a port table > > > > "id" this XDP program is associated with (and if it doesn't exists you > > > > create it). And your userspace "control-plane" application also need > > > > to know this port table "id", when adding a new port. > > > > > > So the user space application that is loading the program also needs > > > to handle this map. This seems correct to me. But I don't see the > > > value in making some new port table when we already have well understood > > > framework for maps. > > > > +1 > > > > > > > > > > The concept of having multiple port tables is key. As this implies we > > > > can have several simultaneous "data-planes" that is *isolated* from > > > > each-other. Think about how network-namespaces/containers want > > > > isolation. A subtle thing I'm afraid to mention, is that oppose to the > > > > ifindex model, a port table with mapping to a net_device pointer, would > > > > allow (faster) delivery into the container's inner net_device, which > > > > sort of violates the isolation, but I would argue it is not a problem > > > > as this net_device pointer could only be added from a process within the > > > > namespace. I like this feature, but it could easily be disallowed via > > > > port insertion-time validation. > > > > > > > > > > I think the above optimization should be allowed. And agree multiple port > > > tables (maps?) is needed. Again all this points to using standard maps > > > logic in my mind. For permissions and different domains, which I think > > > you were starting to touch on, it looks like we could extend the pinning API. > > > At the moment it does an inode_permission(inode, MAY_WRITE) check but I > > > presume this could be extended. None of this would be needed in v1 and > > > could be added subsequently. read-only maps seems doable. > > > > this is great idea. Once BPF_MAP_TYPE_NETDEV_ARRAY is populated > > the user space can make it readonly to prevent further changes. > > > > From user space it can be done similar to perf_events/cgroups as well. > > bpf_map_update_elem(&netdev_array, &port_num, &ifindex) > > should work. > > For bpf_map_lookup_elem() from such netdev_array we can return > > ifindex back. > > The bpf_map_show_fdinfo() can be customized as well to pretty print > > ifindexes of netdevs stored in there. > > > > I agree with both of you on all of these points. Having the port > redirection in a new type of map and/or array seems like the way to go. > > I understood Jesper's perspecitive when thinking about a way to pass a > port-table id down, but I think the idea that the userspace loader code > defining the maps is going to be the one making this link is the right > idea and handling things like ifindex changes (rather than identifiers > that perform lookups in other tables) is going to have to be yet another > exercise left up to the...user. :-) > I love this idea. Integrating the port table closer with the bpf-maps infrastructure makes sense. This gives me a place to hook the code into, instead of (re)inventing a new infrastructure for this port table, and the interface will be more natural from a bpf-API point of view. When registering/attaching a XDP/bpf program, we would just send the file-descriptor for this port-map along (like we do with the bpf_prog FD). Plus, it own ingress-port number this program is in the port-map. It is not clear to me, in-which-data-structure on the kernel-side we store this reference to the port-map and ingress-port. As today we only have the "raw" struct bpf_prog pointer. I see several options: 1. Create a new xdp_prog struct that contains existing bpf_prog, a port-map pointer and ingress-port. (IMHO easiest solution) 2. Just create a new pointer to port-map and store it in driver rx-ring struct (like existing bpf_prog), but this create a race-challenge replacing (cmpxchg) the program (or perhaps it's not a problem as it runs under rcu and RTNL-lock). 3. Extend bpf_prog to store this port-map and ingress-port, and have a fast-way to access it. I assume it will be accessible via bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer