Re: [RFC][PATCH] fanotify: introduce filesystem view mark

Jan Kara <jack@xxxxxxx> · Mon, 10 May 2021 12:13:05 +0200

On Wed 05-05-21 16:24:05, Christian Brauner wrote:
> On Wed, May 05, 2021 at 02:28:15PM +0200, Jan Kara wrote:
> > On Mon 03-05-21 21:44:22, Amir Goldstein wrote:
> > > > > Getting back to this old thread, because the "fs view" concept that
> > > > > it presented is very close to two POCs I tried out recently which leverage
> > > > > the availability of mnt_userns in most of the call sites for fsnotify hooks.
> > > > >
> > > > > The first POC was replacing the is_subtree() check with in_userns()
> > > > > which is far less expensive:
> > > > >
> > > > > https://github.com/amir73il/linux/commits/fanotify_in_userns
> > > > >
> > > > > This approach reduces the cost of check per mark, but there could
> > > > > still be a significant number of sb marks to iterate for every fs op
> > > > > in every container.
> > > > >
> > > > > The second POC is based off the first POC but takes the reverse
> > > > > approach - instead of marking the sb object and filtering by userns,
> > > > > it places a mark on the userns object and filters by sb:
> > > > >
> > > > > https://github.com/amir73il/linux/commits/fanotify_idmapped
> > > > >
> > > > > The common use case is a single host filesystem which is
> > > > > idmapped via individual userns objects to many containers,
> > > > > so normally, fs operations inside containers would have to
> > > > > iterate a single mark.
> > > > >
> > > > > I am well aware of your comments about trying to implement full
> > > > > blown subtree marks (up this very thread), but the userns-sb
> > > > > join approach is so much more low hanging than full blown
> > > > > subtree marks. And as a by-product, it very naturally provides
> > > > > the correct capability checks so users inside containers are
> > > > > able to "watch their world".
> > > > >
> > > > > Patches to allow resolving file handles inside userns with the
> > > > > needed permission checks are also available on the POC branch,
> > > > > which makes the solution a lot more useful.
> > > > >
> > > > > In that last POC, I introduced an explicit uapi flag
> > > > > FAN_MARK_IDMAPPED in combination with
> > > > > FAN_MARK_FILESYSTEM it provides the new capability.
> > > > > This is equivalent to a new mark type, it was just an aesthetic
> > > > > decision.
> > > >
> > > > So in principle, I have no problem with allowing mount marks for ns-capable
> > > > processes. Also FAN_MARK_FILESYSTEM marks filtered by originating namespace
> > > > look OK to me (although if we extended mount marks to support directory
> > > > events as you try elsewhere, would there be still be a compeling usecase for
> > > > this?).
> > > 
> > > In my opinion it would. This is the reason why I stopped that direction.
> > > The difference between FAN_MARK_FILESYSTEM|FAN_MARK_IDMAPPED
> > > and FAN_MARK_MOUNT is that the latter can be easily "escaped" by creating
> > > a bind mount or cloning a mount ns while the former is "sticky" to all additions
> > > to the mount tree that happen below the idmapped mount.
> > 
> > As far as I understood Christian, he was specifically interested in mount
> > events for container runtimes because filtering by 'mount' was desirable
> > for his usecase. But maybe I misunderstood. Christian? Also if you have
> 
> I discussed this with Amir about two weeks ago. For container runtimes
> Amir's idea of generating events based on the userns the fsnotify
> instance was created in is actually quite clever because it gives a way
> for the container to receive events for all filesystems and idmapped
> mounts if its userns is attached to it. The model as we discussed it -
> Amir, please tell me if I'm wrong - is that you'd be setting up an
> fsnotify watch in a given userns and you'd be seeing events from all
> superblocks that have the caller's userns as s_user_ns and all mounts
> that have the caller's userns as mnt_userns. I think that's safe.

OK, so this feature would effectively allow sb-wide watching of events that
are generated from within the container (or its descendants). That sounds
useful. Just one question: If there's some part of a filesystem, that is
accesible by multiple containers (and thus multiple namespaces), or if
there's some change done to the filesystem say by container management SW,
then event for this change won't be visible inside the container (despite
that the fs change itself will be visible). This is kind of a similar
problem to the one we had with mount marks and why sb marks were created.
So aren't we just repeating the mistake with mount marks? Because it seems
to me that more often than not, applications are interested in getting
notification when what they can actually access within the fs has changed
(and this is what they actually get with the inode marks) and they don't
care that much where the change came from... Do you have some idea how
frequent are such cross-ns filesystem changes? I fully appreciate the
simplicity of Amir's proposal but I'm trying to estimate when (or how many)
users are going to come back complaining it is not good enough ;).

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR