Re: fanotify HSM open issues

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Aug 23, 2023 at 5:37 PM Jan Kara <jack@xxxxxxx> wrote:
>
> On Fri 18-08-23 10:01:40, Amir Goldstein wrote:
> > [adding fsdevel]
> >
> > On Thu, Aug 17, 2023 at 9:22 PM Jan Kara <jack@xxxxxxx> wrote:
> > >
> > > On Thu 17-08-23 10:13:20, Amir Goldstein wrote:
> > > > [CC Christian and Jens for the NOWAIT semantics]
> > > >
> > > > Jan,
> > > >
> > > > I was going to post start-write-safe patches [1], but now that this
> > > > design issue has emerged, with your permission, I would like to
> > > > take this discussion to fsdevel, so please reply to the list.
> > > >
> > > > For those who just joined, the context is fanotify HSM API [2]
> > > > proposal and avoiding the fanotify deadlocks I described in my
> > > > talk on LSFMM [3].
> > >
> > > OK, sure. I'm resending the reply which I sent only to you here.
> > >
> > > > On Wed, Aug 16, 2023 at 8:18 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote:
> > > > > On Wed, Aug 16, 2023 at 12:47 PM Jan Kara <jack@xxxxxxx> wrote:
> > > > > > On Mon 14-08-23 16:57:48, Amir Goldstein wrote:
> > > > > > > On Mon, Jul 3, 2023 at 11:03 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote:
> > > > > > > > On Mon, Jul 3, 2023, 9:30 PM Jan Kara <jack@xxxxxxx> wrote:
> > > > > > > do_sendfile() or ovl_copy_up() from ovl1 to xfs1, end up calling
> > > > > > > do_splice_direct() with sb_writers(xfs1) held.
> > > > > > > Internally, the splice operation calls into ovl_splice_read(), which
> > > > > > > has to call the rw_verify_area() check with the fsnotify hook on the
> > > > > > > underlying xfs file.
> > > > > >
> > > > > > Right, we can call rw_verify_area() only after overlayfs has told us what
> > > > > > is actually the underlying file that is really used for reading. Hum,
> > > > > > nasty.
> > > > > >
> > > > > > > This is a violation of start-write-safe permission hooks and the
> > > > > > > lockdep_assert that I added in fsnotify_permission() catches this
> > > > > > > violation.
> > > > > > >
> > > > > > > I believe that a similar issue exists with do_splice_direct() from
> > > > > > > an fs that is loop mounted over an image file on xfs1 to xfs1.
> > > > > >
> > > > > > I don't see how that would be possible. If you have a loop image file on
> > > > > > filesystem xfs1, then the filesystem stored in the image is some xfs2.
> > > > > > Overlayfs case is special here because it doesn't really work with
> > > > > > filesystems but rather directory subtrees and that causes the
> > > > > > complications.
> > > > > >
> > > > >
> > > > > I was referring to sendfile() from xfs2 to xfs1.
> > > > > sb_writers of xfs1 is held, but loop needs to read from the image file
> > > > > in xfs1. No?
> > >
> > > Yes, that seems possible and it would indeed trigger rw_verify_area() in
> > > do_iter_read() on xfs1 while freeze protection for xfs1 is held.
> > >
> >
> > Recap for new people joining this thread.
> >
> > The following deadlock is possible in upstream kernel
> > if fanotify permission event handler tries to make
> > modifications to the filesystem it is watching in the context
> > of FAN_ACCESS_PERM handling in some cases:
> >
> > P1                             P2                      P3
> > -----------                    ------------            ------------
> > do_sendfile(fs1.out_fd, fs1.in_fd)
> > -> sb_start_write(fs1.sb)
> >   -> do_splice_direct()                         freeze_super(fs1.sb)
> >     -> rw_verify_area()                         -> sb_wait_write(fs1.sb) ......
> >       -> security_file_permission()
> >         -> fsnotify_perm() --> FAN_ACCESS_PERM
> >                                  -> do_unlinkat(fs1.dfd, ...)
> >                                    -> sb_start_write(fs1.sb) ......
> >
> > start-write-safe patches [1] (not posted) are trying to solve this
> > deadlock and prepare the ground for a new set of permission events
> > with cleaner/safer semantics.
> >
> > The cases described above of sendfile from a file in loop mounted
> > image over fs1 or overlayfs over fs1 into a file in fs1 can still deadlock
> > despite the start-write-safe patches [1].
>
> Yep, nice summary.
>
> > > > > > > My earlier patches had annotated the rw_verify_area() calls
> > > > > > > in splice iterators as "MAY_NOT_START_WRITE" and the
> > > > > > > userspace event listener was notified via flag whether modifying
> > > > > > > the content of the file was allowed or not.
> > > > > > >
> > > > > > > I do not care so much about HSM being able to fill content of files
> > > > > > > from a nested context like this, but we do need some way for
> > > > > > > userspace to at least deny this access to a file with no content.
> > > > > > >
> > > > > > > Another possibility I thought of is to change file_start_write()
> > > > > > > do use file_start_write_trylock() for files with FMODE_NONOTIFY.
> > > > > > > This should make it safe to fill file content when event is generated
> > > > > > > with sb_writers held (if freeze is in progress modification will fail).
> > > > > > > Right?
> > > > > >
> > > > > > OK, so you mean that the HSM managing application will get an fd with
> > > > > > FMODE_NONOTIFY set from an event and use it for filling in the file
> > > > > > contents and the kernel functions grabbing freeze protection will detect
> > > > > > the file flag and bail with error instead of waiting? That sounds like an
> > > > > > attractive solution - the HSM managing app could even reply with error like
> > > > > > ERESTARTSYS to fanotify event and make the syscall restart (which will
> > > > > > block until the fs is unfrozen and then we can try again) and thus handle
> > > > > > the whole problem transparently for the application generating the event.
> > > > > > But I'm just dreaming now, for start it would be fine to just fail the
> > > > > > syscall.
> > > > > >
> > > > >
> > > > > IMO, a temporary error from an HSM controlled fs is not a big deal.
> > > > > Same as a temporary error from a network fs or FUSE - should be
> > > > > tolerable when the endpoint is not connected.
> > > > > One of my patches allows HSM returning an error that is not EPERM as
> > > > > response - this can be useful in such situations.
> > >
> > > OK.
> > >
> > > > > > I see only three possible problems with the solution. Firstly, the HSM
> > > > > > application will have to be careful to only access the managed filesystem
> > > > > > with the fd returned from fanotify event as otherwise it could deadlock on
> > > > > > frozen filesystem.
> > > > >
> > > > > Isn't that already the case to some extent?
> > > > > It is not wise for permission event handlers to perform operations
> > > > > on fd without  FMODE_NONOTIFY.
> > >
> > > Yes, it isn't a new problem. The amount of bug reports in our bugzilla
> > > boiling down to this kind of self-deadlock just shows that fanotify users
> > > get this wrong all the time.
> > >
> > > > > > That may seem obvious but practice shows that with
> > > > > > complex software stacks with many dependencies, this is far from trivial.
> > > > >
> > > > > It will be especially important when we have permission events
> > > > > on directory operations that need to perform operations on O_PATH
> > > > > dirfd with FMODE_NONOTIFY.
> > > > >
> > > > > > Secondly, conditioning the trylock behavior on FMODE_NONOTIFY seems
> > > > > > somewhat arbitary unless you understand our implementation issues and
> > > > > > possibly it could regress current unsuspecting users. So I'm thinking
> > > > > > whether we shouldn't rather have an explicit open flag requiring erroring
> > > > > > out on frozen filesystem instead of blocking and the HSM application will
> > > > > > need to use it to evade freezing deadlocks. Or we can just depend on
> > > > > > RWF_NOWAIT flag (we currently block on frozen filesystem despite this flag
> > > > > > but that can be viewed as a bug) but that's limited to writes (i.e., no way
> > > > > > to e.g. do fallocate(2) without blocking on frozen fs).
> > > > >
> > > > > User cannot ask for fd with FMODE_NONOTIFY as it is - this is provided
> > > > > as a means to an end by fanotify - so it would not be much different if
> > > > > the new events would provide an fd with FMODE_NONOTIFY |
> > > > > FMODE_NOWAIT. It will be up to documentation to say what is and what
> > > > > is not allowed with the event->fd provided by fanotify.
> > > > >
> > > >
> > > > This part needs clarifying.
> > > > Technically, we can use the flag FMODE_NOWAIT to prevent waiting in
> > > > file_start_write() *when* it is combined with FMODE_NONOTIFY.
> > > >
> > > > Yes, it would be a change of behavior, but I think it would be a good change,
> > > > because current event->fd from FAN_ACCESS_PERM events is really not
> > > > write-safe (could deadlock with freezing fs).
> > >
> > > As I wrote above I don't like the abuse of FMODE_NONOTIFY much.
> > > FMODE_NONOTIFY means we shouldn't generate new fanotify events when using
> > > this fd. It says nothing about freeze handling or so. Furthermore as you
> > > observe FMODE_NONOTIFY cannot be set by userspace but practically all
> > > current fanotify users need to also do IO on other files in order to handle
> > > fanotify event. So ideally we'd have a way to do IO to other files in a
> > > manner safe wrt freezing. We could just update handling of RWF_NOWAIT flag
> > > to only trylock freeze protection - that actually makes a lot of sense to
> > > me. The question is whether this is enough or not.
> > >
> >
> > Maybe, but RWF_NOWAIT doesn't take us far enough, because writing
> > to a file is not the only thing that HSM needs to do.
> > Eventually, event handler for lookup permission events should be
> > able to also create files without blocking on vfs level freeze protection.
>
> So this is what I wanted to clarify. The lookup permission event never gets
> called under a freeze protection so the deadlock doesn't exist there. In
> principle the problem exists only for access and modify events where we'd
> be filling in file data and thus RWF_NOWAIT could be enough.

Yes, you are right.
It is possible that RWF_NOWAIT could be enough.

But the discovery of the loop/ovl corner cases has shaken my
confidence is the ability to guarantee that freeze protection is not
held somehow indirectly.

If I am not mistaken, FAN_OPEN_PERM suffers from the exact
same ovl corner case, because with splice from ovl1 to fs1,
fs1 freeze protection is held and:
  ovl_splice_read(ovl1.file)
    ovl_real_fdget()
      ovl_open_realfile(fs1.file)
         ... security_file_open(fs1.file)

> That being
> said I understand this may be assuming too much about the implementations
> of HSM daemons and as you write, we might want to provide a way to do IO
> not blocking on freeze protection from any hook. But I wanted to point this
> out explicitely so that it's a conscious decision.
>
> > In theory, I am not saying we should do it, but as a thought experiment:
> > if the requirement from permission event handler is that is must use a
> > O_PATH | FMODE_NONOTIFY event->fd provided in the event to make
> > any filesystem modifications, then instead of aiming for NOWAIT
> > semantics using sb_start_write_trylock(), we could use a freeze level
> > SB_FREEZE_FSNOTIFY between
> > SB_FREEZE_WRITE and SB_FREEZE_PAGEFAULT.
> >
> > As a matter of fact, HSM is kind of a "VFS FAULT", so as long as we
> > make it clear how userspace should avoid nesting "VFS faults" there is
> > a model that can solve the deadlock correctly.
>
> OK, yes, in principle another freeze level which could be used by handlers
> of fanotify permission events would solve the deadlock as well. Just you
> seem to like to tie this functionality to the particular fd returned from
> fanotify and I'm not convinced that is a good idea. What if the application
> needs to do write to some other location besides the one fd it got passed
> from fanotify event? E.g. imagine it wants to fetch a whole subtree on
> first access to any file in a subtree. Or maybe it wants to write to some
> DB file containing current state or something like that.
>
> One solution I can imagine is to create an open flag that can be specified
> on open which would result in the special behavior wrt fs freezing. If the
> special behavior would be just trylocking the freeze protection then it
> would be really easy. If the behaviour would be another freeze protection
> level, then we'd need to make sure we don't generate another fanotify
> permission event with such fd - autorejecting any such access is an obvious
> solution but I'm not sure if practical for applications.
>

I had also considered marking the listener process with the FSNOTIFY
context and enforcing this context on fanotify_read().
In a way, this is similar to the NOIO and NOFS process context.
It could be used to both act as a stronger form of FMODE_NONOTIFY
and to activate the desired freeze protection behavior
(whether trylock or SB_FREEZE_FSNOTIFY level).

> > > > Then we have two options:
> > > > 1. Generate "write-safe" FAN_PRE_ACCESS events only for fs that set
> > > >     FMODE_NOWAIT.
> > > >     Other fs will still generate the legacy FAN_ACCESS_PERM events
> > > >     which will be documented as write-unsafe
> > > > 2. Use a new internal flag (e.g. FMODE_NOSBWAIT) for the stronger
> > > >     NOWAIT semantics that fanotify will always set on event->fd for the
> > > >     new write-safe FAN_PRE_ACCESS events
> > > >
> > > > TBH, the backing fs for HSM [2] is anyway supposed to be a "normal"
> > > > local fs and I'd be more comfortable with fs opting in to support fanotify
> > > > HSM events, so option #1 doesn't seem like a terrible idea??
> > >
> > > Yes, I don't think 1) would be really be a limitation that would matter too
> > > much in practice.
> > >
> > > > > Currently, the documentation is missing, because there are operations
> > > > > that are not really safe in the permission event context, but there is no
> > > > > documentation about that.
> > > > >
> > > > > > Thirdly, unless we
> > > > > > propagate to the HSM app the information whether the freeze protection is
> > > > > > held in the kernel or not, it doesn't know whether it should just wait for
> > > > > > the filesystem to unfreeze or whether it should rather fail the request to
> > > > > > avoid the deadlock. Hrm...
> > > > >
> > > > > informing HSM if freeze protection is held by this thread may be a little
> > > > > challenging, but it is easy for me to annotate possible risky contexts
> > > > > like the hooks inside splice read.
> > > > > I am just not sure that waiting in HSM context is that important and
> > > > > if it is not better to always fail in the frozen fs case.
> > >
> > > Always failing in frozen fs case is certainly possible but that will make
> > > fs freezing a bit non-transparent - the application may treat such failures
> > > as fatal errors and abort. So it's ok for the first POC but eventually we
> > > should have a plan how we could make fs freezing transparent for the
> > > applications even for HSM managed filesystems.
> > >
> >
> > OK. ATM, the only solution I can think of that is both maintainable
> > and lets HSM live in complete harmony with fsfreeze is adding the
> > extra SB_FREEZE_FSNOTIFY level.
>
> To make things clear: if the only problems would be with those sendfile(2)
> rare corner-cases, then I guess we can live with that and implement retry
> in the kernel if userspace ever complains about unexpected short copy or
> EAGAIN...  The problem I see is that if we advise that all IO from the
> fanotify event handler should happen in the freeze-safe manner, then with
> the non-blocking solution all HSM IO suddently starts failing as soon as
> the filesystem is frozen. And that is IMHO not nice.

I see what you mean. The SB_FREEZE_FSNOTIFY design is much more
clear in that respect.

> > I am not sure how big of an overhead that would be?
> > I imagine that sb_writers is large enough as it is w.r.t fitting into
> > cache lines?
> > I don't think that it adds much complexity or maintenance burden
> > to vfs?? I'm really not sure.
>
> Well, the overhead is effectively one percpu counter per superblock.
> Negligible in terms of CPU time, somewhat annoying in terms of memory but
> bearable. So this may be a way forward.
>

Considering that I had already spend some time optimizing out
the memory and performance overhead of s_write_srcu:
https://github.com/amir73il/linux/commit/655606ca8cffba7636959547dc094651cef56f4d
I guess I may be able to also optimize out the SB_FREEZE_FSNOTIFY
percpu counter under the same object that is allocated lazily for
sb on the first mark with the new write-safe fanotify events.

Ok. I have something to try..

Thanks!
Amir.




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux