On Mon, Apr 13, 2020 at 04:37:16PM -0400, Tejun Heo wrote: > Hello, > > On Mon, Apr 13, 2020 at 09:59:15PM +0200, Christian Brauner wrote: > > Right, pid namespaces deal with a single random identifier about which > > userspace makes no assumptions other than that it's a positive number so > > generating aliases is fine. In addition pid namespaces are nicely > > I don't see any fundamental differences between pids and device numbers. One > of the reasons pid namespace does aliasing instead of just showing subsets is > because applications can have expectations on what the specific numbers should > be - e.g. for checkpoint-restarting. One difference is that ownership is hierarchial in a pid namespace. This becomes clear when looking at the parent child relationship when creating new processes in nested pid namespaces. All processes created in the innermost pid namespace are owned by that pid namespaces's init process. If that pid namespace's init/subreaper process dies all processes get zapped and autoreaped. In essence, unless the ancestor pid namespace has setns()ed a process in there, ownership of that process is clearly defined. I don't think that model is transferable to a device. What seems most important to me here is that a pid namespace completely defines ownership of a process. But there's not necessarily a single namespace that guarantees ownership for all device types. Network devices, imho are a good example for that. Their full ownership is network namespace + user namespace actually. You could easily envision other device classes where a combination of namespaces would make sense. > > > hierarchical. I fear that we might introduce unneeded complexity if we > > go this way and start generating aliases for devices that userspace > > It adds complexity for sure but the other side of the scale is losing > visiblity into what devices are on the system, which can become really nasty > in practice, so I do think it can probably justify some additional complexity > especially if it's something which can be used by different devices. Even just > for block, if we end up expanding ns support to regular block devices for some > reason, it's kinda dreadful to think a situation where all devices can't be > discovered at the system level. Hm, it is already the case that we can't see all devices at the system level. That includes network devices and also binderfs devices (the latter don't have sysfs entries though which is what this is about). And for virtual devices just as loop, binder, and network devices this is fine imho. They are not physicall attached to your computer. Actual disk devices where this would matter wouldn't be namespaced anyway imho. We also need to consider that it is potentially dangerous for a namespace to trigger a device event tricking the host into performing an action on it. If e.g. the creation of a network device were to propagate into all namespaces and there'd be a rule matching it you could trick the host into performing privileged actions it. So it also isn't obviously safe propagating devices out of their namespace. (I fixed something similar to this just recently in a sysfs series.) I addition the file ownership permissions would propagate from the inner to all outer sysfs instances as well which would mean you'd suddenly have 100000:100000 entries in your host's sysfs in the initial namespace. > > > already knows about and has expectations of. We also still face some of > > the other problems I mentioned. > > I do think that what you say might make sense to explore in more detail > > for a new device class (or type under a given class) that userspace does > > not yet know about and were we don't regress anything. > > I don't quite follow why adding namespace support would break existing users. > Wouldn't namespace usage be opt-in? For sysfs, this change is opt-in per device type and it only applies to loop devices here, i.e. if you don't e.g. use loopfs nothing changes for you at all. If you use it, all that you get is correct ownership for sysfs entries for those loop devices accounted to you in addition to all the other entries that have always been there. This way we can handle legacy workloads cleanly which we really want for our use-case. Your model would effectively require a new version of sysfs where you e.g. mount it with a new option that zaps all device entries that don't belong to non-initial user namespaces. Which would mean most major tools in containers will break completely. We can still totally try to bring up a change like this independent of this patchset. This patchset doesn't rule this out at all. Christian