Re: Fanotify: concurrent work and handling files being executed

Sargun Dhillon <sargun@xxxxxxxxx> · Tue, 6 Feb 2024 09:29:55 -0700

On Tue, Feb 6, 2024 at 6:50 AM Jan Kara <jack@xxxxxxx> wrote:
>
> On Tue 06-02-24 09:44:29, Amir Goldstein wrote:
> > On Tue, Feb 6, 2024 at 1:24 AM Sargun Dhillon <sargun@xxxxxxxxx> wrote:
> > >
> > > One of the issues we've hit recently while using fanotify in an HSM is
> > > racing with files that are opened for execution.
> > >
> > > There is a race that can result in ETXTBUSY.
> > > Pid 1: You have a file marked with FAN_OPEN_EXEC_PERM.
> > > Pid 2: execve(file_by_path)
> > > Pid 1: gets notification, with file.fd
> > > Pid 2: blocked, waiting for notification to resolve
> > > Pid 1: Does work with FD (populates the file)
> > > Pid 1: writes FAN_ALLOW to the fanotify file descriptor allowing the event.
> > > Pid 2: continues, and falls through to deny_write_access (and fails)
> > > Pid 1: closes fd
>
> Right, this is kind of nasty.
>
> > > Pid 1 can close the FD before responding, but this can result in a
> > > race if fanotify is being handled in a multi-threaded
> > > manner.
>
> Yep.
>
> > > I.e. if there are two threads operating on the same fanotify group,
> > > and an event's FD has been closed, that can be reused
> > > by another event. This is largely not a problem because the
> > > outstanding events are added in a FIFO manner to the outstanding
> > > event list, and as long as the earlier event is closed and responded
> > > to without interruption, it should be okay, but it's difficult
> > > to guarantee that this happens, unless event responses are serialized
> > > in some fashion, with strict ordering between
> > > responses.
>
> Yes, essentially you must make sure you will not read any new events from
> the notification queue between fd close & writing of the response. Frankly,
> I find this as quite ugly and asking for trouble (subtle bugs down the
> line).
>
Is there a preference for either refactoring fanotify_event_metadata, or
adding this new ID type as a piece of metadata?

I almost feel like the FD should move to being metadata, and we should
use ID in place of fd in fanotify_event_metadata. If we use an xarray,
it should be reasonable to use a 32-bit identifier, so we don't need
to modify the fanotify_event_metadata structure at all.

> > > There are a couple of ways I see around this:
> > > 1. Have a flag in the fanotify response that's like FAN_CLOSE_FD,
> > > where fanotify_write closes the fd when
> > > it processes the response.
> >
> > That seems doable and useful.
> >
I can work on this. We already have FAN_AUDIT and FAN_INFO as optional
bits that can be set in the response. I know there's another thread talking
about using the high-bits for error codes, and I wouldn't want to pollute
the higher bits too much if we go down that route.

I'm also not sure how best to implement capability probing (other than you
get an EINVAL in userspace and retry without that bit being set).

> > > 2. Make the response identifier separate from the FD. This can either
> > > be an IDR / xarray, or a 64-bit always
> > > incrementing number. The benefit of using an xarray is that responses
> > > can than be handled in O(1) speed
> > > whereas the current implementation is O(n) in relationship to the
> > > number of outstanding events.
> >
> > The number of outstanding permission events is usually limited
> > by the number of events that are handled concurrently.
> >
> > I think that a 64-bit event id is a worthy addition to a well designed
> > notifications API. I am pretty sure that both Windows and MacOS
> > filesystem notification events have a 64-bit event id.
>
> I agree that having 64-bit id in an event and just increment it with each
> event would be simple and fine way to identify events. This could then be
> used to match responses to events when we would be reporting permission
> events for FID groups as I outlined in my email yesterday [1].
>
> > If we ever implement filesystem support for persistent change
> > notification journals (as in Windows), those ids would be essential.
>
> How do you want to use it for persistent change journal? I'm mostly asking
> because currently I expect the number to start from 0 for each fsnotify
> group created but if you'd like to persist somewhere the event id, this
> would not be ideal?
>
> > > This can be implemented by adding an additional piece of response
> > > metadata, and then that becomes the
> > > key vs. fd on response.
> > > ---
> > >
> > > An aside, ETXTBUSY / i_writecount is a real bummer. We want to be able
> > > to populate file content lazily,
> > > and I realize there are many steps between removing the write lock,
> > > and being able to do this, but given
> > > that you can achieve this with FUSE, NFS, EROFS / cachefilesd, it
> > > feels somewhat arbitrary to continue
> > > to have this in place for executable files only.
> >
> > I think there are way too many security models built around this
> > behavior to be able to change it for the common case.
> >
> > However, I will note that technically, the filesystem does not
> > need to require a file open for write to fill its content as long as the
> > file content is written directly to disk and as long as page cache
> > is not populated in that region and invalidate_lock is held.
>
> I was thinking about this as well and as you say, technically there is no
> problem with this - just bypass writer the checks when opening the file -
> but I think it just gives userspace too much rope to hang itself on. So if
> we are going to support something like this it will require very careful
> API design and wider consensus among VFS people this is something we are
> willing to support.
>
> > Requiring FMODE_WRITE for btrfs_ioctl_encoded_write()
> > may make sense, but it is not a must.
> > A more fine grained page cache protection approach is also
> > an option, a bit like (or exactly like) exported pNFS layouts.
> >
> > IOW, you could potentially implement lazy filling of executable
> > content with specific file systems that support an out of band
> > file filling API, but then we would also need notifications for
> > page faults, so let's leave that for the far future.
>
> That's another good point that currently we don't generate fsnotify events
> on page faults. For lazy filling of executables this is basically must-have
> functionality and I'd say that even for lazy filling of other files,
> handling notifications on page faults will come up sooner rather than
> later. So this is probably the thing we'll have to handle first.
>
>                                                                 Honza
>
> [1] https://lore.kernel.org/all/20240205182718.lvtgfsxcd6htbqyy@quack3
> --
> Jan Kara <jack@xxxxxxxx>
> SUSE Labs, CR