On Fri, Oct 09, 2009 at 07:44:01PM -0700, Stephen Hemminger wrote: > Maybe I'm dense but can't see why having a useless /dev/net/ symlinks > is a good interface choice. Perhaps you should explain the race between > PCI scan and udev in more detail, and why solving it in either of those > places won't work. As it stands you are proposing yet another wart to > the already complex set of network interface API's which has implications > for security as well as increasing the number of possible bugs. The fundamental challenge is that system administrators, particularly those of server-class hardware with multiple network ports present (some on the motherboard, some on add-in cards), have the not-so-unreasonable expectation that there is a deterministic mapping between those ports and the name one uses to address those ports. The fundamental roadblock to this is that enumeration != naming, except that it is for network devices, and we keep changing the enumeration order. Today, port naming is completely nondeterministic. If you have but one NIC, there are few chances to get the name wrong (it'll be eth0). If you have >1 NIC, chances increase to get it wrong. The complexity arises at multiple levels. First, device driver load order. In the 2.4 kernel days, and even mostly early 2.6 kernel days, the order in which network drivers loaded played a role in determining the name of the device. Drivers loaded first would get their devices named first. If I have two types of devices, say an e100-driven NIC and a tg3-driven NIC, I could figure out that the names would be eth0=e100 and eth1=tg3 by setting the load order in /etc/modules.conf (now modprobe.conf). If I wanted the other order, fine, just switch it around in modules.conf and reboot. OS installers, being the first running instance of Linux, before modprobe.conf existed to set that ordering, had to have other mechanisms to load drivers (often manually, or if programmatically such as in a kickstart or autoyast file, was still somewhat fixed). With the advent of modaliases + udev, now modprobe.conf doesn't contain this ordering anymore, and udev loads the drivers. So while it wasn't perfect, it was better than nothing, and that's gone now. It gets even worse as, to speed up boot time, modprobes can be run in parallel, and even within individual drivers, the NICs get initialized (and named) in parallel. Further confusing things, some devices need firmware loaded into them before getting names assigned, which is done from userspace, and they race. Second, PCI device list order. In the 2.4 kernel days, the PCI device list was scanned "breadth-first" (for each bus; for each device; for each function; do load...). FWIW, Windows still does this. It gives BIOS, which assigns PCI bus numbers, a chance to put LOMs at a lower bus number than add-in cards. Module load order still mattered, but at least if you had say 2 e1000 ports as LOMs, and 2 e1000 ports on add-in cards, you pretty much knew the ordering would be eth0 as lowest bdf on the motherboard, eth1 as next bdf on the motherboard, and eth2 and 3 as the add-in cards in ascending slot order. With the advent of PCI hot plug in the 2.5 kernel series, the breadth-first ordering became depth-first. (for each bus; for each device; if the device is a bridge, scan the busses behind it.). This caused NICs on bus 0 device 5, and bus 1 device 3, (eth0 and 1 respectively) to be enumerated differently due to the a bridge from bus 0 to bus 1 at 0:4. My crude hack of pci=bfsort, with some dmi strings to match and auto-enable, at least reverted this back to the ordering the 2.4 kernel and Windows used. Now we have to keep adding systems to this DMI list (Dell has a number of systems on this list today; HP has even more). And it doesn't completely solve the problem, just masks it. So, to address the ordering problem, I placed a constraint on our server hardware teams, forcing them to lay out their boards and assign PCIe lanes and bus numbers, such that at least the designed "first" LOM would get found first in either depth-first or breadth-first order. Our 10G and 11G servers have this restriction in place, though it wasn't easy. And it's gotten even harder, as the PCIe switches expand the number of lanes available. We no longer have the traditional tiered buses architecture, but the PCI layer for this purpose thinks we do. I need to remove this constraint on the hardware teams - it's gotten to be impossible for the chipset lanes to be laid out efficiently with this constraint. All of the above just papered over the enumeration != naming problem. Third, stateless computing is becoming more and more commonplace. The Field Replaceable Unit is the server itself. Got a bad server? Pull it out, move the disks to an identical unit, insert the new server, and go. Fix the bad server offline and bring it back. In this model, having MAC addresses as the mechanism that is providing the determinism (/etc/mactab or udev persistent naming rules) breaks, because the MAC addresses of the ports on the new server won't be the same as on the old server. HP even has a technology to solve _this_ problem (in their blade chassis) - Virtual Connect. The MACs get assigned by the chassis to the blades at POST, and are fixed to the slot. Slick, and Dell has an even more flexible similar feature FlexAddress. This doesn't solve the OS installer problem of "which of these NICs should I use to do an install?" but it does recognize the problem space and tries to overcome it. Fourth, for OS installers, choosing which NIC to use at installtime, when all the NICs are plugged in, can be difficult. PXE environments, using pxelinux and its IPAPPEND 2 option, will append "BOOTIF=xx:xx:xx:xx:xx:xx" to the kernel command line, that containing the MAC address of the NIC used for PXE. Neat trick. Yes, we then had to teach the OS installers to recognize and use this. But it only works if you PXE boot, and only for that one NIC. Fifth, network devices can have only a single name. eth0. If we look at disks, we see udev manages a tree of symlinks for /dev/disk/by-label, /dev/disk/by-path, /dev/disk/by-uuid. And as a system admin, if I wanted to also create a udev rule for /dev/disk/by-function (boot, swap, mattsstorage), it's trivial to do so. Why can't we have this flexibility for network devices too? So, how do we get deterministic naming for all the NICs in a system? That's what I'm going for. Picture a network switch, with several blades, and several ports on each blade. The network admin addresses each port as say 1/16 (the 16th port on blade 1, clearly labeled). The parallel on servers is the chassis label printed on the outside (say, "Gb1"). But due the above, there is no guarantee, and in fact little chance, that Gb1 will be consistently named eth0 - it may vary from boot to boot. That's full of fail. For a concrete example, the 4 bnx2 chips in my PowerEdge R610 with a current 2.6 kernel, loading only one driver, the ports get assigned names in nondeterministic order on each boot. Given that the ifcfg-eth* rules, netfilter rules, and the rest all expect deterministic naming, massive failure ensues unless some form of determinism is brought back in. The idea to use a character device node to expose the ifindex value, and udev to manage a tree of symlinks to it, really follows the model used today for disks. It allows us to get deterministic names for devices (albeit, the names are symlinks), and multiple names for devices (through multiple symlink rules). That some people want to use the char device to call ioctl() and read/write, as is possible on the BSDs, would just be gravy IMHO. It does require a change in behavior for a system administrator. Instead of hard-coding 'eth0' into her scripts, she uses '/dev/net/by-function/boot' or somesuch. But then that name is guaranteed to always refer to the "right" NIC. Every admin I've spoken to is willing to make this kind of change, as long as they get the consistent, deterministic naming they expect but don't have today. And it does require patching userspace apps to take both a kernel device name, or a path, and to resolve the path to device name or ifindex. We wrote libnetdevname (really, one function), and have patches for several userspace apps to use it, to prove it can be done. One alternative would be to do something using the sysfs ifindex value already exported. e.g. /sys/devices/pci0000:00/0000:00:05.0/0000:05:00.0/0000:06:07.0/net/eth0/ifindex but we have never had symlinks from /dev into /sys before (doesn't mean we couldn't though). In that case, udev would grow to manage /dev/net/by-chassis-label/Embedded_NIC_1 -> /sys/devices/.../net/eth0, and libnetdevname would be used to follow the symlink in applications. This approach could solve my problem without (many or any?) kernel changes needed, but wouldn't help those who want to do ioctl/read/write to a devnode. Given the problem, I really do need a solution. I've proposed one method, and an alternative, but I can't afford to let the problem stay unaddressed any longer, and need a clear direction to be chosen. The char device gives me what I need, and others what they want also. Thanks for listening to the diatribe. For more examples and workarounds that we've been telling our customers for several years, check out http://linux.dell.com/papers.shtml for the Network Interface Card Naming whitepaper. -- Matt Domsch Technology Strategist, Dell Office of the CTO linux.dell.com & www.dell.com/linux -- To unsubscribe from this list: send the line "unsubscribe linux-hotplug" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html