Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Fri, 28 Feb 2020 07:08:41 -0800

On Fri, 2020-02-28 at 09:35 +0100, Miklos Szeredi wrote:
> On Fri, Feb 28, 2020 at 1:43 AM Ian Kent <raven@xxxxxxxxxx> wrote:
> 
> > > I'm not sure about sysfs/, you need somehow resolve namespaces,
> > > order of the mount entries (which one is the last one), etc. IMHO
> > > translate mountpoint path to sysfs/ path will be complicated.
> > 
> > I wonder about that too, after all sysfs contains a tree of nodes
> > from which the view is created unlike proc which translates kernel
> > information directly based on what the process should see.
> > 
> > We'll need to wait a bit and see what Miklos has in mind for mount
> > table enumeration and nothing has been said about name spaces yet.
> 
> Adding Greg for sysfs knowledge.
> 
> As far as I understand the sysfs model is, basically:
> 
>   - list of devices sorted by class and address
>   - with each class having a given set of attributes
> 
> Superblocks and mounts could get enumerated by a unique identifier.
> mnt_id seems to be good for mounts, s_dev may or may not be good for
> superblock, but  s_id (as introduced in this patchset) could be used
> instead.
> 
> As for namespaces, that's "just" an access control issue, AFAICS.

That's an easy thing to say but not an easy thing to check:  it can be
made so for label based namespaces like the network, but the mount
namespace is shared/cloned tree based.  Assessing whether a given
superblock is within your current namespace root can become a large
search exercise.  You can see how much of one in fs/proc_namespaces.c
which controls how /proc/self/mounts appears in your current namespace.

> For example a task with a non-initial mount namespace should not have
> access to attributes of mounts outside of its namespace.  Checking
> access to superblock attributes would be similar: scan the list of
> mounts and only allow access if at least one mount would get access.

That scan can be expensive as I explained above.  That's really why I
think this is a bad idea.  Sysfs itself is nicely currently restricted
to system information that most containers don't need to know, so a lot
of the sysfs issues with containers can be solved by not mounting it. 
If you suddenly make it required for filesystem information and
notifications, that security measure gets blown out of the water.

> > While fsinfo() is not similar to proc it does handle name spaces
> > in a sensible way via. file handles, a bit similar to the proc fs,
> > and ordering is catered for in the fsinfo() enumeration in a
> > natural way. Not sure how that would be handled using sysfs ...
> 
> I agree that the access control is much more straightforward with
> fsinfo(2) and this may be the single biggest reason to introduce a
> new syscall.
> 
> Let's see what others thing.

Containers are file based entities, so file descriptors are their most
natural thing and they have full ACL protection within the container
(can't open the file, can't then get the fd).  The other reason
container people like file descriptors (all the Xat system calls that
have been introduced) is that if we do actually need to break the
boundaries or privileges of the container, we can do so by getting the
orchestration system to pass in a fd the interior of the container
wouldn't have access to.

James