Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

Steven Rostedt <rostedt@xxxxxxxxxxx> · Fri, 12 Jan 2024 08:53:44 -0500

On Fri, 12 Jan 2024 09:27:24 +0100
Christian Brauner <brauner@xxxxxxxxxx> wrote:

> On Thu, Jan 11, 2024 at 04:53:19PM -0500, Steven Rostedt wrote:
> > On Thu, 11 Jan 2024 22:01:32 +0100
> > Christian Brauner <brauner@xxxxxxxxxx> wrote:
> >   
> > > What I'm pointing out in the current logic is that the caller is
> > > taxed twice:
> > > 
> > > (1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs")
> > > (2) And again when you call lookup_one_len() in eventfs_start_creating()
> > >     _because_ the permission check in lookup_one_len() is the exact
> > >     same permission check again that the vfs has done
> > >     inode_permission(MAY_EXEC, "xfs").  
> > 
> > As I described in: https://lore.kernel.org/all/20240110133154.6e18feb9@xxxxxxxxxxxxxxxxxx/
> > 
> > The eventfs files below "events" doesn't need the .permissions callback at
> > all. It's only there because the "events" inode uses it.
> > 
> > The .permissions call for eventfs has:  
> 
> It doesn't matter whether there's a ->permission handler. If you don't
> add one explicitly the VFS will simply call generic_permission():
> 
> inode_permission()
> -> do_inode_permission()  
>    {
>         if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) {
>                 if (likely(inode->i_op->permission))
>                         return inode->i_op->permission(idmap, inode, mask);
>                
>                 /* This gets set once for the inode lifetime */
>                 spin_lock(&inode->i_lock);
>                 inode->i_opflags |= IOP_FASTPERM;
>                 spin_unlock(&inode->i_lock);
>         }
>         return generic_permission(idmap, inode, mask);
>    }

Yes I know that, because that's where I knew what to call in the non
"events" dir case.

> 
> > Anyway, the issue is with "events" directory and remounting, because like
> > the tracefs system, the inode and dentry for "evnets" is created at boot
> > up, before the mount happens. The VFS layer is going to check the
> > permissions of its inode and dentry, which will be incorrect if the mount
> > was mounted with a "gid" option.  
> 
> The gid option has nothing to do with this and it is just handled fine
> if you remove the second permission checking in (2).

I guess I'm confused to what you are having an issue with. Is it just
that the permission check gets called twice?

> 
> You need to remove the inode_permission() code from
> eventfs_start_creating(). It is just an internal lookup and the fact
> that you have it in there allows userspace to break readdir on the
> eventfs portions of tracefs as I've shown in the parts of the mail that
> you cut off.

That's because I didn't see how it was related to the way I fixed the
mount=gid issue. Are you only concerned because of the check in
eventfs_start_creating()?

Yes, you posted code that would make things act funny for some code
that I see no real world use case for. Yeah, it may not act "properly"
but I'm not sure that's bad.

Here, I'll paste it back:

> // We managed to open the directory so we have permission to list
> // directory entries in "xfs".
> fd = open("/sys/kernel/tracing/events/xfs");
> 
> // Remove ownership so we can't open the directory anymore
> chown("/sys/kernel/tracing/events/xfs", 0, 0);
> 
> // Or just remove exec bit for the group and restrict to owner
> chmod("/sys/kernel/tracing/events/xfs", 700);
> 
> // Drop caches to force an eventfs_root_lookup() on everything
> write("/proc/sys/vm/drop_caches", "3", 1);

This requires opening the directory, then having it's permissions
change, and then immediately dropping the caches.

> 
> // Returns 0 even though directory has a lot of entries and we should be
> // able to list them
> getdents64(fd, ...);

And do we care?

Since tracing exposes internal kernel information, perhaps this is a
feature and not a bug. If someone who had access to the tracing system
and you wanted to stop them, if they had a directory open that they no
longer have access to, you don't want them to see what's left in the
directory.

In other words, I like the idea that the getends64(fd, ...) will fail!

If there's a file underneath that wasn't change, and the admin thought
that just keeping the top directory permissions off is good enough,
then that attacker having that directory open before the directory had
it's file permissions change is a way to still have access to the files
below it.

> 
> And the failure is in the inode_permission(MAY_EXEC, "xfs") call in
> lookup_one_len() in eventfs_start_creating() which now fails.

And I think is a good thing!

Again, tracefs is special. It gives you access and possibly control to
the kernel behavior. I like the fact that as soon as someone loses
permission to a directory, they immediately lose it.

-- Steve