On Mon, Aug 25, 2014 at 8:43 AM, Nicolas Dichtel <nicolas.dichtel@xxxxxxxxx> wrote: > Le 25/08/2014 16:04, Andy Lutomirski a écrit : > >> On Aug 25, 2014 6:30 AM, "Nicolas Dichtel" <nicolas.dichtel@xxxxxxxxx> >> wrote: >>>> >>>> CRIU wants to save the complete state of a namespace and then restore >>>> it. For that to work, any information exposed to things in the >>>> namespace *cannot* be globally unique or unique per boot, since CRIU >>>> needs to arrange for that information to match whatever it was when >>>> CRIU saved it. >>> >>> >>> How are ifindex of network devices managed? These ifindexes are unique >>> per boot, >>> thus can change depending on the order in which netdev are created. >>> These ifindexes are unique per boot and exposed to userspace ... >>> >> >> This does not appear to be true. >> >> $ sudo unshare --net >> # ip link add veth0 type veth peer name veth1 >> # ip link >> 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group >> default >> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >> 2: veth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode >> DEFAULT group default qlen 1000 >> link/ether 06:0d:59:c7:a6:a8 brd ff:ff:ff:ff:ff:ff >> 3: veth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode >> DEFAULT group default qlen 1000 >> link/ether b2:5c:8b:f2:12:28 brd ff:ff:ff:ff:ff:ff >> # logout >> $ ip link >> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN >> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >> 3: em1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast >> state DOWN qlen 1000 >> > I've probably misunderstood what you're trying to say. ifindexes are unique > per > boot and per netns. I think we both misunderstood each other. The ifindexes are unique *per netns*, which means that, if you're unprivileged in a netns, global information doesn't leak to you. I think this is good. >> >> Let me try again, with emphasis in the right place. >> >> I think that *code running in a namespace* has no business even >> knowing a unique identity of *that namespace* from the perspective of >> the host. >> >> In your example, if there's a veth device between netns A and netns B, >> then code *in netns A* has no business knowing the identity of its >> veth peer if its peer (B) is a sibling or ancestor. It also IMO has >> no business knowing the identity of its own netns (A) other than as >> "my netns". > > I do not agree (see the example below). > > >> >> If A and B are siblings, then their parent needs to know where that >> veth device goes, but I think this is already the case to a sufficient >> extent today. > > I'm not aware of a hierarchy between netns. A daemon should be able to > got the full network configuration, even if it's started when this > configuration > is already applied, ie even if it doesn't know what happen before it starts. > I don't know exactly which namespaces have an explicit hierarchy, but there is certainly a hierarchy of *user* namespaces, and network namespaces live in user namespaces, so they at least have somewhat of a hierarchy. > >> >> I feel like this discussion is falling into a common trap of new API >> discussions. Can one of you who wants this API please articulate, >> with a reasonably precise example, what it is that you want to do, why >> you can't easily do it already, and how this API helps? I currently >> understand how the API creates problems, but I don't understand how it >> solves any problems, and I will NAK it (and I suspect that Eric will, >> too, which is pretty much fatal) unless that changes. > > What I'm trying to solve is to have full info in netlink messages sent by > the > kernel, thus beeing able to identify a peer netns (and this is close from > what > audit guys are trying to have). Theorically, messages sent by the kernel can > be > reused as is to have the same configuration. This is not the case with > x-netns > devices. Here is an example, with ip tunnels: > > $ ip netns add 1 > $ ip link add ipip1 type ipip remote 10.16.0.121 local 10.16.0.249 dev eth0 > $ ip -d link ls ipip1 > 8: ipip1@eth0: <POINTOPOINT,NOARP> mtu 1480 qdisc noop state DOWN mode > DEFAULT group default > link/ipip 10.16.0.249 peer 10.16.0.121 promiscuity 0 > ipip remote 10.16.0.121 local 10.16.0.249 dev eth0 ttl inherit pmtudisc > $ ip link set ipip1 netns 1 > $ ip netns exec 1 ip -d link ls ipip1 > 8: ipip1@tunl0: <POINTOPOINT,NOARP,M-DOWN> mtu 1480 qdisc noop state DOWN > mode DEFAULT group default > link/ipip 10.16.0.249 peer 10.16.0.121 promiscuity 0 > ipip remote 10.16.0.121 local 10.16.0.249 dev tunl0 ttl inherit pmtudisc > > Now informations got with 'ip link' are wrong and incomplete: > - the link dev is now tunl0 instead of eth0, because we only got an ifindex > from the kernel without any netns informations. > - the encapsulation addresses are not part of this netns but the user > doesn't > known that (still because netns info is missing). These IPv4 addresses > may > exist into this netns. > - it's not possible to create the same netdevice with these infos. > Aha. That's a genuine problem. Perhaps we need a concept of which netnses should be able to see each other. I think I would be okay with a somewhat different outcome from your example: $ ip netns exec 1 ip -d link ls ipip1 8: ipip1@[unknown device in another namespace]: <POINTOPOINT,NOARP,M-DOWN> mtu 1480 qdisc noop state DOWN I think this outcome is mandatory if netns 1 lives in a subsidiary user namespace. Certainly, if you do the 'ip link' in the original namespace, I agree that this should work. For most namespace types, this all works transparently, since everything has an real identity all the way up the hierarchy. Network namespaces are different. I don't think that exposing serial numbers in /proc is a good solution, both for the reasons already described and because I don't think that iproute2 should need to muck around with /proc to function correctly. Eric, any clever ideas here? Do we need fancier netlink messages for this? --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html