On 4/26/17 9:35 AM, John Fastabend wrote:
As Alexei also mentioned before, ifindex vs port makes no real difference seen from the bpf program side. It is userspace's responsibility to add ifindex/port's to the bpf-maps, according to how the bpf program "policy" want to "connect" these ports. The port-table system add one extra step, of also adding this port to the port-table (which lives inside the kernel).I'm not sure I understand the "lives inside the kernel" bit. I assumed the 'map' should be a bpf map and behave like any other bpf map. I wanted a new map to be defined, something like this from the bpf programmer side. struct bpf_map_def SEC("maps") port_table = .type = BPF_MAP_TYPE_PORT_CONNECTION, .key_size = sizeof(u32), .value_size = BPF_PORT_CONNECTION_SIZE, .max_entries = 256, };
I like the idea. We have prog_array, perf_event_array, cgroup_array map specializations. This one can be new netdev_array with some new bpf_redirect-like helper accessing it.
When loading the XDP program, we also need to pass along a port table "id" this XDP program is associated with (and if it doesn't exists you create it). And your userspace "control-plane" application also need to know this port table "id", when adding a new port.So the user space application that is loading the program also needs to handle this map. This seems correct to me. But I don't see the value in making some new port table when we already have well understood framework for maps.
+1
The concept of having multiple port tables is key. As this implies we can have several simultaneous "data-planes" that is *isolated* from each-other. Think about how network-namespaces/containers want isolation. A subtle thing I'm afraid to mention, is that oppose to the ifindex model, a port table with mapping to a net_device pointer, would allow (faster) delivery into the container's inner net_device, which sort of violates the isolation, but I would argue it is not a problem as this net_device pointer could only be added from a process within the namespace. I like this feature, but it could easily be disallowed via port insertion-time validation.I think the above optimization should be allowed. And agree multiple port tables (maps?) is needed. Again all this points to using standard maps logic in my mind. For permissions and different domains, which I think you were starting to touch on, it looks like we could extend the pinning API. At the moment it does an inode_permission(inode, MAY_WRITE) check but I presume this could be extended. None of this would be needed in v1 and could be added subsequently. read-only maps seems doable.
this is great idea. Once BPF_MAP_TYPE_NETDEV_ARRAY is populated the user space can make it readonly to prevent further changes. From user space it can be done similar to perf_events/cgroups as well. bpf_map_update_elem(&netdev_array, &port_num, &ifindex) should work. For bpf_map_lookup_elem() from such netdev_array we can return ifindex back. The bpf_map_show_fdinfo() can be customized as well to pretty print ifindexes of netdevs stored in there.