On Fri, 2020-02-28 at 09:35 +0100, Miklos Szeredi wrote: > On Fri, Feb 28, 2020 at 1:43 AM Ian Kent <raven@xxxxxxxxxx> wrote: > > > > I'm not sure about sysfs/, you need somehow resolve namespaces, > > > order of the mount entries (which one is the last one), etc. IMHO > > > translate mountpoint path to sysfs/ path will be complicated. > > > > I wonder about that too, after all sysfs contains a tree of nodes > > from which the view is created unlike proc which translates kernel > > information directly based on what the process should see. > > > > We'll need to wait a bit and see what Miklos has in mind for mount > > table enumeration and nothing has been said about name spaces yet. > > Adding Greg for sysfs knowledge. > > As far as I understand the sysfs model is, basically: > > - list of devices sorted by class and address > - with each class having a given set of attributes > > Superblocks and mounts could get enumerated by a unique identifier. > mnt_id seems to be good for mounts, s_dev may or may not be good for > superblock, but s_id (as introduced in this patchset) could be used > instead. > > As for namespaces, that's "just" an access control issue, AFAICS. That's an easy thing to say but not an easy thing to check: it can be made so for label based namespaces like the network, but the mount namespace is shared/cloned tree based. Assessing whether a given superblock is within your current namespace root can become a large search exercise. You can see how much of one in fs/proc_namespaces.c which controls how /proc/self/mounts appears in your current namespace. > For example a task with a non-initial mount namespace should not have > access to attributes of mounts outside of its namespace. Checking > access to superblock attributes would be similar: scan the list of > mounts and only allow access if at least one mount would get access. That scan can be expensive as I explained above. That's really why I think this is a bad idea. Sysfs itself is nicely currently restricted to system information that most containers don't need to know, so a lot of the sysfs issues with containers can be solved by not mounting it. If you suddenly make it required for filesystem information and notifications, that security measure gets blown out of the water. > > While fsinfo() is not similar to proc it does handle name spaces > > in a sensible way via. file handles, a bit similar to the proc fs, > > and ordering is catered for in the fsinfo() enumeration in a > > natural way. Not sure how that would be handled using sysfs ... > > I agree that the access control is much more straightforward with > fsinfo(2) and this may be the single biggest reason to introduce a > new syscall. > > Let's see what others thing. Containers are file based entities, so file descriptors are their most natural thing and they have full ACL protection within the container (can't open the file, can't then get the fd). The other reason container people like file descriptors (all the Xat system calls that have been introduced) is that if we do actually need to break the boundaries or privileges of the container, we can do so by getting the orchestration system to pass in a fd the interior of the container wouldn't have access to. James