Re: Fanotify: concurrent work and handling files being executed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue 06-02-24 09:44:29, Amir Goldstein wrote:
> On Tue, Feb 6, 2024 at 1:24 AM Sargun Dhillon <sargun@xxxxxxxxx> wrote:
> >
> > One of the issues we've hit recently while using fanotify in an HSM is
> > racing with files that are opened for execution.
> >
> > There is a race that can result in ETXTBUSY.
> > Pid 1: You have a file marked with FAN_OPEN_EXEC_PERM.
> > Pid 2: execve(file_by_path)
> > Pid 1: gets notification, with file.fd
> > Pid 2: blocked, waiting for notification to resolve
> > Pid 1: Does work with FD (populates the file)
> > Pid 1: writes FAN_ALLOW to the fanotify file descriptor allowing the event.
> > Pid 2: continues, and falls through to deny_write_access (and fails)
> > Pid 1: closes fd

Right, this is kind of nasty.

> > Pid 1 can close the FD before responding, but this can result in a
> > race if fanotify is being handled in a multi-threaded
> > manner.

Yep.

> > I.e. if there are two threads operating on the same fanotify group,
> > and an event's FD has been closed, that can be reused
> > by another event. This is largely not a problem because the
> > outstanding events are added in a FIFO manner to the outstanding
> > event list, and as long as the earlier event is closed and responded
> > to without interruption, it should be okay, but it's difficult
> > to guarantee that this happens, unless event responses are serialized
> > in some fashion, with strict ordering between
> > responses.

Yes, essentially you must make sure you will not read any new events from
the notification queue between fd close & writing of the response. Frankly,
I find this as quite ugly and asking for trouble (subtle bugs down the
line).

> > There are a couple of ways I see around this:
> > 1. Have a flag in the fanotify response that's like FAN_CLOSE_FD,
> > where fanotify_write closes the fd when
> > it processes the response.
> 
> That seems doable and useful.
> 
> > 2. Make the response identifier separate from the FD. This can either
> > be an IDR / xarray, or a 64-bit always
> > incrementing number. The benefit of using an xarray is that responses
> > can than be handled in O(1) speed
> > whereas the current implementation is O(n) in relationship to the
> > number of outstanding events.
> 
> The number of outstanding permission events is usually limited
> by the number of events that are handled concurrently.
> 
> I think that a 64-bit event id is a worthy addition to a well designed
> notifications API. I am pretty sure that both Windows and MacOS
> filesystem notification events have a 64-bit event id.

I agree that having 64-bit id in an event and just increment it with each
event would be simple and fine way to identify events. This could then be
used to match responses to events when we would be reporting permission
events for FID groups as I outlined in my email yesterday [1].

> If we ever implement filesystem support for persistent change
> notification journals (as in Windows), those ids would be essential.

How do you want to use it for persistent change journal? I'm mostly asking
because currently I expect the number to start from 0 for each fsnotify
group created but if you'd like to persist somewhere the event id, this
would not be ideal?

> > This can be implemented by adding an additional piece of response
> > metadata, and then that becomes the
> > key vs. fd on response.
> > ---
> >
> > An aside, ETXTBUSY / i_writecount is a real bummer. We want to be able
> > to populate file content lazily,
> > and I realize there are many steps between removing the write lock,
> > and being able to do this, but given
> > that you can achieve this with FUSE, NFS, EROFS / cachefilesd, it
> > feels somewhat arbitrary to continue
> > to have this in place for executable files only.
> 
> I think there are way too many security models built around this
> behavior to be able to change it for the common case.
> 
> However, I will note that technically, the filesystem does not
> need to require a file open for write to fill its content as long as the
> file content is written directly to disk and as long as page cache
> is not populated in that region and invalidate_lock is held.

I was thinking about this as well and as you say, technically there is no
problem with this - just bypass writer the checks when opening the file -
but I think it just gives userspace too much rope to hang itself on. So if
we are going to support something like this it will require very careful
API design and wider consensus among VFS people this is something we are
willing to support.

> Requiring FMODE_WRITE for btrfs_ioctl_encoded_write()
> may make sense, but it is not a must.
> A more fine grained page cache protection approach is also
> an option, a bit like (or exactly like) exported pNFS layouts.
> 
> IOW, you could potentially implement lazy filling of executable
> content with specific file systems that support an out of band
> file filling API, but then we would also need notifications for
> page faults, so let's leave that for the far future.

That's another good point that currently we don't generate fsnotify events
on page faults. For lazy filling of executables this is basically must-have
functionality and I'd say that even for lazy filling of other files,
handling notifications on page faults will come up sooner rather than
later. So this is probably the thing we'll have to handle first.

								Honza

[1] https://lore.kernel.org/all/20240205182718.lvtgfsxcd6htbqyy@quack3
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux