Re: fanotify HSM open issues

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Aug 23, 2023 at 7:31 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote:
>
> On Wed, Aug 23, 2023 at 5:37 PM Jan Kara <jack@xxxxxxx> wrote:
> >
> > On Fri 18-08-23 10:01:40, Amir Goldstein wrote:
> > > [adding fsdevel]
> > >
> > > On Thu, Aug 17, 2023 at 9:22 PM Jan Kara <jack@xxxxxxx> wrote:
> > > >
> > > > On Thu 17-08-23 10:13:20, Amir Goldstein wrote:
> > > > > [CC Christian and Jens for the NOWAIT semantics]
> > > > >
> > > > > Jan,
> > > > >
> > > > > I was going to post start-write-safe patches [1], but now that this
> > > > > design issue has emerged, with your permission, I would like to
> > > > > take this discussion to fsdevel, so please reply to the list.
> > > > >
> > > > > For those who just joined, the context is fanotify HSM API [2]
> > > > > proposal and avoiding the fanotify deadlocks I described in my
> > > > > talk on LSFMM [3].
> > > >
> > > > OK, sure. I'm resending the reply which I sent only to you here.
> > > >
> > > > > On Wed, Aug 16, 2023 at 8:18 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote:
> > > > > > On Wed, Aug 16, 2023 at 12:47 PM Jan Kara <jack@xxxxxxx> wrote:
> > > > > > > On Mon 14-08-23 16:57:48, Amir Goldstein wrote:
> > > > > > > > On Mon, Jul 3, 2023 at 11:03 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote:
> > > > > > > > > On Mon, Jul 3, 2023, 9:30 PM Jan Kara <jack@xxxxxxx> wrote:
> > > > > > > > do_sendfile() or ovl_copy_up() from ovl1 to xfs1, end up calling
> > > > > > > > do_splice_direct() with sb_writers(xfs1) held.
> > > > > > > > Internally, the splice operation calls into ovl_splice_read(), which
> > > > > > > > has to call the rw_verify_area() check with the fsnotify hook on the
> > > > > > > > underlying xfs file.
> > > > > > >
> > > > > > > Right, we can call rw_verify_area() only after overlayfs has told us what
> > > > > > > is actually the underlying file that is really used for reading. Hum,
> > > > > > > nasty.
> > > > > > >
> > > > > > > > This is a violation of start-write-safe permission hooks and the
> > > > > > > > lockdep_assert that I added in fsnotify_permission() catches this
> > > > > > > > violation.
> > > > > > > >
> > > > > > > > I believe that a similar issue exists with do_splice_direct() from
> > > > > > > > an fs that is loop mounted over an image file on xfs1 to xfs1.
> > > > > > >
> > > > > > > I don't see how that would be possible. If you have a loop image file on
> > > > > > > filesystem xfs1, then the filesystem stored in the image is some xfs2.
> > > > > > > Overlayfs case is special here because it doesn't really work with
> > > > > > > filesystems but rather directory subtrees and that causes the
> > > > > > > complications.
> > > > > > >
> > > > > >
> > > > > > I was referring to sendfile() from xfs2 to xfs1.
> > > > > > sb_writers of xfs1 is held, but loop needs to read from the image file
> > > > > > in xfs1. No?
> > > >
> > > > Yes, that seems possible and it would indeed trigger rw_verify_area() in
> > > > do_iter_read() on xfs1 while freeze protection for xfs1 is held.
> > > >
> > >
> > > Recap for new people joining this thread.
> > >
> > > The following deadlock is possible in upstream kernel
> > > if fanotify permission event handler tries to make
> > > modifications to the filesystem it is watching in the context
> > > of FAN_ACCESS_PERM handling in some cases:
> > >
> > > P1                             P2                      P3
> > > -----------                    ------------            ------------
> > > do_sendfile(fs1.out_fd, fs1.in_fd)
> > > -> sb_start_write(fs1.sb)
> > >   -> do_splice_direct()                         freeze_super(fs1.sb)
> > >     -> rw_verify_area()                         -> sb_wait_write(fs1.sb) ......
> > >       -> security_file_permission()
> > >         -> fsnotify_perm() --> FAN_ACCESS_PERM
> > >                                  -> do_unlinkat(fs1.dfd, ...)
> > >                                    -> sb_start_write(fs1.sb) ......
> > >
> > > start-write-safe patches [1] (not posted) are trying to solve this
> > > deadlock and prepare the ground for a new set of permission events
> > > with cleaner/safer semantics.
> > >
> > > The cases described above of sendfile from a file in loop mounted
> > > image over fs1 or overlayfs over fs1 into a file in fs1 can still deadlock
> > > despite the start-write-safe patches [1].
> >
> > Yep, nice summary.
> >
> > > > > > > > My earlier patches had annotated the rw_verify_area() calls
> > > > > > > > in splice iterators as "MAY_NOT_START_WRITE" and the
> > > > > > > > userspace event listener was notified via flag whether modifying
> > > > > > > > the content of the file was allowed or not.
> > > > > > > >
> > > > > > > > I do not care so much about HSM being able to fill content of files
> > > > > > > > from a nested context like this, but we do need some way for
> > > > > > > > userspace to at least deny this access to a file with no content.
> > > > > > > >
> > > > > > > > Another possibility I thought of is to change file_start_write()
> > > > > > > > do use file_start_write_trylock() for files with FMODE_NONOTIFY.
> > > > > > > > This should make it safe to fill file content when event is generated
> > > > > > > > with sb_writers held (if freeze is in progress modification will fail).
> > > > > > > > Right?
> > > > > > >
> > > > > > > OK, so you mean that the HSM managing application will get an fd with
> > > > > > > FMODE_NONOTIFY set from an event and use it for filling in the file
> > > > > > > contents and the kernel functions grabbing freeze protection will detect
> > > > > > > the file flag and bail with error instead of waiting? That sounds like an
> > > > > > > attractive solution - the HSM managing app could even reply with error like
> > > > > > > ERESTARTSYS to fanotify event and make the syscall restart (which will
> > > > > > > block until the fs is unfrozen and then we can try again) and thus handle
> > > > > > > the whole problem transparently for the application generating the event.
> > > > > > > But I'm just dreaming now, for start it would be fine to just fail the
> > > > > > > syscall.
> > > > > > >
> > > > > >
> > > > > > IMO, a temporary error from an HSM controlled fs is not a big deal.
> > > > > > Same as a temporary error from a network fs or FUSE - should be
> > > > > > tolerable when the endpoint is not connected.
> > > > > > One of my patches allows HSM returning an error that is not EPERM as
> > > > > > response - this can be useful in such situations.
> > > >
> > > > OK.
> > > >
> > > > > > > I see only three possible problems with the solution. Firstly, the HSM
> > > > > > > application will have to be careful to only access the managed filesystem
> > > > > > > with the fd returned from fanotify event as otherwise it could deadlock on
> > > > > > > frozen filesystem.
> > > > > >
> > > > > > Isn't that already the case to some extent?
> > > > > > It is not wise for permission event handlers to perform operations
> > > > > > on fd without  FMODE_NONOTIFY.
> > > >
> > > > Yes, it isn't a new problem. The amount of bug reports in our bugzilla
> > > > boiling down to this kind of self-deadlock just shows that fanotify users
> > > > get this wrong all the time.
> > > >
> > > > > > > That may seem obvious but practice shows that with
> > > > > > > complex software stacks with many dependencies, this is far from trivial.
> > > > > >
> > > > > > It will be especially important when we have permission events
> > > > > > on directory operations that need to perform operations on O_PATH
> > > > > > dirfd with FMODE_NONOTIFY.
> > > > > >
> > > > > > > Secondly, conditioning the trylock behavior on FMODE_NONOTIFY seems
> > > > > > > somewhat arbitary unless you understand our implementation issues and
> > > > > > > possibly it could regress current unsuspecting users. So I'm thinking
> > > > > > > whether we shouldn't rather have an explicit open flag requiring erroring
> > > > > > > out on frozen filesystem instead of blocking and the HSM application will
> > > > > > > need to use it to evade freezing deadlocks. Or we can just depend on
> > > > > > > RWF_NOWAIT flag (we currently block on frozen filesystem despite this flag
> > > > > > > but that can be viewed as a bug) but that's limited to writes (i.e., no way
> > > > > > > to e.g. do fallocate(2) without blocking on frozen fs).
> > > > > >
> > > > > > User cannot ask for fd with FMODE_NONOTIFY as it is - this is provided
> > > > > > as a means to an end by fanotify - so it would not be much different if
> > > > > > the new events would provide an fd with FMODE_NONOTIFY |
> > > > > > FMODE_NOWAIT. It will be up to documentation to say what is and what
> > > > > > is not allowed with the event->fd provided by fanotify.
> > > > > >
> > > > >
> > > > > This part needs clarifying.
> > > > > Technically, we can use the flag FMODE_NOWAIT to prevent waiting in
> > > > > file_start_write() *when* it is combined with FMODE_NONOTIFY.
> > > > >
> > > > > Yes, it would be a change of behavior, but I think it would be a good change,
> > > > > because current event->fd from FAN_ACCESS_PERM events is really not
> > > > > write-safe (could deadlock with freezing fs).
> > > >
> > > > As I wrote above I don't like the abuse of FMODE_NONOTIFY much.
> > > > FMODE_NONOTIFY means we shouldn't generate new fanotify events when using
> > > > this fd. It says nothing about freeze handling or so. Furthermore as you
> > > > observe FMODE_NONOTIFY cannot be set by userspace but practically all
> > > > current fanotify users need to also do IO on other files in order to handle
> > > > fanotify event. So ideally we'd have a way to do IO to other files in a
> > > > manner safe wrt freezing. We could just update handling of RWF_NOWAIT flag
> > > > to only trylock freeze protection - that actually makes a lot of sense to
> > > > me. The question is whether this is enough or not.
> > > >
> > >
> > > Maybe, but RWF_NOWAIT doesn't take us far enough, because writing
> > > to a file is not the only thing that HSM needs to do.
> > > Eventually, event handler for lookup permission events should be
> > > able to also create files without blocking on vfs level freeze protection.
> >
> > So this is what I wanted to clarify. The lookup permission event never gets
> > called under a freeze protection so the deadlock doesn't exist there. In
> > principle the problem exists only for access and modify events where we'd
> > be filling in file data and thus RWF_NOWAIT could be enough.
>
> Yes, you are right.
> It is possible that RWF_NOWAIT could be enough.
>
> But the discovery of the loop/ovl corner cases has shaken my
> confidence is the ability to guarantee that freeze protection is not
> held somehow indirectly.
>
> If I am not mistaken, FAN_OPEN_PERM suffers from the exact
> same ovl corner case, because with splice from ovl1 to fs1,
> fs1 freeze protection is held and:
>   ovl_splice_read(ovl1.file)
>     ovl_real_fdget()
>       ovl_open_realfile(fs1.file)
>          ... security_file_open(fs1.file)
>
> > That being
> > said I understand this may be assuming too much about the implementations
> > of HSM daemons and as you write, we might want to provide a way to do IO
> > not blocking on freeze protection from any hook. But I wanted to point this
> > out explicitly so that it's a conscious decision.
> >

I agree and I'd like to explain using an example, why RWF_NOWAIT is
not enough for HSM needs.

The reason is that often, when HSM needs to handle filling content
in FAN_PRE_ACCESS, it is not just about writing to the accessed file.
HSM needs to be able to avoid blocking on freeze protection
for any operations on the filesystem, not just pwrite().

For example, the POC HSM code [1], stores the DATA_DIR_fd
from the lookup event and uses it in the handling of access events to
update the metadata files that store which parts of the file were already
filled (relying of fiemap is not always a valid option).

That is the reason that in the POC patches [2], FMODE_NONOTIFY
is propagated from dirfd to an fd opened with openat(dirfd, ...), so
HSM has an indirect way to get a FMODE_NONOTIFY fd on any file.

Another use case is that HSM may want to download content to a
temp file on the same filesystem, verify the downloaded content and
then clone the data into the accessed file range.

I think that a PF_ flag (see below) would work best for all those cases.

> > > In theory, I am not saying we should do it, but as a thought experiment:
> > > if the requirement from permission event handler is that is must use a
> > > O_PATH | FMODE_NONOTIFY event->fd provided in the event to make
> > > any filesystem modifications, then instead of aiming for NOWAIT
> > > semantics using sb_start_write_trylock(), we could use a freeze level
> > > SB_FREEZE_FSNOTIFY between
> > > SB_FREEZE_WRITE and SB_FREEZE_PAGEFAULT.
> > >
> > > As a matter of fact, HSM is kind of a "VFS FAULT", so as long as we
> > > make it clear how userspace should avoid nesting "VFS faults" there is
> > > a model that can solve the deadlock correctly.
> >
> > OK, yes, in principle another freeze level which could be used by handlers
> > of fanotify permission events would solve the deadlock as well. Just you
> > seem to like to tie this functionality to the particular fd returned from
> > fanotify and I'm not convinced that is a good idea. What if the application
> > needs to do write to some other location besides the one fd it got passed
> > from fanotify event? E.g. imagine it wants to fetch a whole subtree on
> > first access to any file in a subtree. Or maybe it wants to write to some
> > DB file containing current state or something like that.
> >
> > One solution I can imagine is to create an open flag that can be specified
> > on open which would result in the special behavior wrt fs freezing. If the
> > special behavior would be just trylocking the freeze protection then it
> > would be really easy. If the behaviour would be another freeze protection
> > level, then we'd need to make sure we don't generate another fanotify
> > permission event with such fd - autorejecting any such access is an obvious
> > solution but I'm not sure if practical for applications.
> >
>
> I had also considered marking the listener process with the FSNOTIFY
> context and enforcing this context on fanotify_read().
> In a way, this is similar to the NOIO and NOFS process context.
> It could be used to both act as a stronger form of FMODE_NONOTIFY
> and to activate the desired freeze protection behavior
> (whether trylock or SB_FREEZE_FSNOTIFY level).
>

My feeling is that the best approach would be a PF_NOWAIT task flag:

- PF_NOWAIT will prevent blocking on freeze protection
- PF_NOWAIT + FMODE_NOWAIT would imply RWF_NOWAIT
- PF_NOWAIT could be auto-set on the reader of a permission event
- PF_NOWAIT could be set on init of group FAN_CLASS_PRE_PATH
- We could add user API to set this personality explicitly to any task
- PF_NOWAIT without FMODE_NONOTIFY denies permission events

Please let me know if you agree with this design and if so,
which of the methods to set PF_NOWAIT are a must for the first version
in your opinion?

Do you think we should use this method to fix the existing deadlocks
with FAN_OPEN_PERM and FAN_ACCESS_PERM? without opt-in?

[...]

> > > OK. ATM, the only solution I can think of that is both maintainable
> > > and lets HSM live in complete harmony with fsfreeze is adding the
> > > extra SB_FREEZE_FSNOTIFY level.
> >
> > To make things clear: if the only problems would be with those sendfile(2)
> > rare corner-cases, then I guess we can live with that and implement retry
> > in the kernel if userspace ever complains about unexpected short copy or
> > EAGAIN...  The problem I see is that if we advise that all IO from the
> > fanotify event handler should happen in the freeze-safe manner, then with
> > the non-blocking solution all HSM IO suddently starts failing as soon as
> > the filesystem is frozen. And that is IMHO not nice.
>
> I see what you mean. The SB_FREEZE_FSNOTIFY design is much more
> clear in that respect.
>
> > > I am not sure how big of an overhead that would be?
> > > I imagine that sb_writers is large enough as it is w.r.t fitting into
> > > cache lines?
> > > I don't think that it adds much complexity or maintenance burden
> > > to vfs?? I'm really not sure.
> >
> > Well, the overhead is effectively one percpu counter per superblock.
> > Negligible in terms of CPU time, somewhat annoying in terms of memory but
> > bearable. So this may be a way forward.
> >
>

My feeling is that because we only need this to handle very obscure
corner cases, that adding an extra freeze level is an overkill that
cannot be justified, even if the actual impact on cpu and memory are
rather low.

The HSM API documentation will clearly state that EAGAIN may be
expected when writing to the filesystem.

IMO, for all practical matters, it is perfectly fine if HSM just denies
access in those corner cases, but even a simple solution of triggering
async download of file's content and returning a temporary to user
is a decent solution for the rare corner cases.

FYI, I've already gotten requests from people in the community that
are waiting for this feature and are testing the POC patches,
so my plan is to send out the permission hooks cleanup patches [3]
soon and try to get the first part of the HSM API [4]
(FAN_PRE_ACCESS and FAN_PRE_MODIFY permission events)
ready for the next cycle.

In any case, permission hooks cleanup patches are independent
of the solution we will choose for the corner cases that they do
not handle.

Thanks,
Amir.

[2] https://github.com/amir73il/httpdirfs/commits/fan_lookup_perm
[2] https://github.com/amir73il/linux/commits/fan_lookup_perm
[3] https://github.com/amir73il/linux/commits/start-write-safe
[4] https://github.com/amir73il/linux/commits/fan_pre_content





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux