Re: fanotify HSM open issues

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Nov 27, 2023 at 4:57 PM Christian Brauner <brauner@xxxxxxxxxx> wrote:
>
> On Mon, Nov 27, 2023 at 04:48:23PM +0200, Amir Goldstein wrote:
> > On Mon, Nov 27, 2023 at 3:56 PM Christian Brauner <brauner@xxxxxxxxxx> wrote:
> > >
> > > On Mon, Nov 20, 2023 at 06:59:47PM +0200, Amir Goldstein wrote:
> > > > On Mon, Nov 20, 2023 at 4:06 PM Jan Kara <jack@xxxxxxx> wrote:
> > > > >
> > > > > Hi Amir,
> > > > >
> > > > > sorry for a bit delayed reply, I did not get to "swapping in" HSM
> > > > > discussion during the Plumbers conference :)
> > > > >
> > > > > On Mon 13-11-23 13:50:03, Amir Goldstein wrote:
> > > > > > On Wed, Aug 23, 2023 at 7:31 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote:
> > > > > > > On Wed, Aug 23, 2023 at 5:37 PM Jan Kara <jack@xxxxxxx> wrote:
> > > > > > > > > Recap for new people joining this thread.
> > > > > > > > >
> > > > > > > > > The following deadlock is possible in upstream kernel
> > > > > > > > > if fanotify permission event handler tries to make
> > > > > > > > > modifications to the filesystem it is watching in the context
> > > > > > > > > of FAN_ACCESS_PERM handling in some cases:
> > > > > > > > >
> > > > > > > > > P1                             P2                      P3
> > > > > > > > > -----------                    ------------            ------------
> > > > > > > > > do_sendfile(fs1.out_fd, fs1.in_fd)
> > > > > > > > > -> sb_start_write(fs1.sb)
> > > > > > > > >   -> do_splice_direct()                         freeze_super(fs1.sb)
> > > > > > > > >     -> rw_verify_area()                         -> sb_wait_write(fs1.sb) ......
> > > > > > > > >       -> security_file_permission()
> > > > > > > > >         -> fsnotify_perm() --> FAN_ACCESS_PERM
> > > > > > > > >                                  -> do_unlinkat(fs1.dfd, ...)
> > > > > > > > >                                    -> sb_start_write(fs1.sb) ......
> > > > > > > > >
> > > > > > > > > start-write-safe patches [1] (not posted) are trying to solve this
> > > > > > > > > deadlock and prepare the ground for a new set of permission events
> > > > > > > > > with cleaner/safer semantics.
> > > > > > > > >
> > > > > > > > > The cases described above of sendfile from a file in loop mounted
> > > > > > > > > image over fs1 or overlayfs over fs1 into a file in fs1 can still
> > > > > > > > > deadlock despite the start-write-safe patches [1].
> > > > > > > >
> > > > > > > > Yep, nice summary.
> > > > > ...
> > > > > > > > > > As I wrote above I don't like the abuse of FMODE_NONOTIFY much.
> > > > > > > > > > FMODE_NONOTIFY means we shouldn't generate new fanotify events when using
> > > > > > > > > > this fd. It says nothing about freeze handling or so. Furthermore as you
> > > > > > > > > > observe FMODE_NONOTIFY cannot be set by userspace but practically all
> > > > > > > > > > current fanotify users need to also do IO on other files in order to handle
> > > > > > > > > > fanotify event. So ideally we'd have a way to do IO to other files in a
> > > > > > > > > > manner safe wrt freezing. We could just update handling of RWF_NOWAIT flag
> > > > > > > > > > to only trylock freeze protection - that actually makes a lot of sense to
> > > > > > > > > > me. The question is whether this is enough or not.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Maybe, but RWF_NOWAIT doesn't take us far enough, because writing
> > > > > > > > > to a file is not the only thing that HSM needs to do.
> > > > > > > > > Eventually, event handler for lookup permission events should be
> > > > > > > > > able to also create files without blocking on vfs level freeze protection.
> > > > > > > >
> > > > > > > > So this is what I wanted to clarify. The lookup permission event never gets
> > > > > > > > called under a freeze protection so the deadlock doesn't exist there. In
> > > > > > > > principle the problem exists only for access and modify events where we'd
> > > > > > > > be filling in file data and thus RWF_NOWAIT could be enough.
> > > > > > >
> > > > > > > Yes, you are right.
> > > > > > > It is possible that RWF_NOWAIT could be enough.
> > > > > > >
> > > > > > > But the discovery of the loop/ovl corner cases has shaken my
> > > > > > > confidence is the ability to guarantee that freeze protection is not
> > > > > > > held somehow indirectly.
> > > > > > >
> > > > > > > If I am not mistaken, FAN_OPEN_PERM suffers from the exact
> > > > > > > same ovl corner case, because with splice from ovl1 to fs1,
> > > > > > > fs1 freeze protection is held and:
> > > > > > >   ovl_splice_read(ovl1.file)
> > > > > > >     ovl_real_fdget()
> > > > > > >       ovl_open_realfile(fs1.file)
> > > > > > >          ... security_file_open(fs1.file)
> > > > > > >
> > > > > > > > That being
> > > > > > > > said I understand this may be assuming too much about the implementations
> > > > > > > > of HSM daemons and as you write, we might want to provide a way to do IO
> > > > > > > > not blocking on freeze protection from any hook. But I wanted to point this
> > > > > > > > out explicitly so that it's a conscious decision.
> > > > > > > >
> > > > > >
> > > > > > I agree and I'd like to explain using an example, why RWF_NOWAIT is
> > > > > > not enough for HSM needs.
> > > > > >
> > > > > > The reason is that often, when HSM needs to handle filling content
> > > > > > in FAN_PRE_ACCESS, it is not just about writing to the accessed file.
> > > > > > HSM needs to be able to avoid blocking on freeze protection
> > > > > > for any operations on the filesystem, not just pwrite().
> > > > > >
> > > > > > For example, the POC HSM code [1], stores the DATA_DIR_fd
> > > > > > from the lookup event and uses it in the handling of access events to
> > > > > > update the metadata files that store which parts of the file were already
> > > > > > filled (relying of fiemap is not always a valid option).
> > > > > >
> > > > > > That is the reason that in the POC patches [2], FMODE_NONOTIFY
> > > > > > is propagated from dirfd to an fd opened with openat(dirfd, ...), so
> > > > > > HSM has an indirect way to get a FMODE_NONOTIFY fd on any file.
> > > > > >
> > > > > > Another use case is that HSM may want to download content to a
> > > > > > temp file on the same filesystem, verify the downloaded content and
> > > > > > then clone the data into the accessed file range.
> > > > > >
> > > > > > I think that a PF_ flag (see below) would work best for all those cases.
> > > > >
> > > > > Ok, I agree that just using RWF_NOWAIT from the HSM daemon need not be
> > > > > enough for all sensible usecases to avoid deadlocks with freezing. However
> > > > > note that if we want to really properly handle all possible operations, we
> > > > > need to start handling error from all sb_start_write() and
> > > > > file_start_write() calls and there are quite a few of those.
> > > > >
> > > >
> > > > Darn, forgot about those.
> > > > I am starting to reconsider adding a freeze level.
> > > > I cannot shake the feeling that there is a simpler solution that escapes us...
> > > > Maybe fs anti-freeze (see blow).
> > > >
> > > > > > > > > In theory, I am not saying we should do it, but as a thought experiment:
> > > > > > > > > if the requirement from permission event handler is that is must use a
> > > > > > > > > O_PATH | FMODE_NONOTIFY event->fd provided in the event to make
> > > > > > > > > any filesystem modifications, then instead of aiming for NOWAIT
> > > > > > > > > semantics using sb_start_write_trylock(), we could use a freeze level
> > > > > > > > > SB_FREEZE_FSNOTIFY between
> > > > > > > > > SB_FREEZE_WRITE and SB_FREEZE_PAGEFAULT.
> > > > > > > > >
> > > > > > > > > As a matter of fact, HSM is kind of a "VFS FAULT", so as long as we
> > > > > > > > > make it clear how userspace should avoid nesting "VFS faults" there is
> > > > > > > > > a model that can solve the deadlock correctly.
> > > > > > > >
> > > > > > > > OK, yes, in principle another freeze level which could be used by handlers
> > > > > > > > of fanotify permission events would solve the deadlock as well. Just you
> > > > > > > > seem to like to tie this functionality to the particular fd returned from
> > > > > > > > fanotify and I'm not convinced that is a good idea. What if the application
> > > > > > > > needs to do write to some other location besides the one fd it got passed
> > > > > > > > from fanotify event? E.g. imagine it wants to fetch a whole subtree on
> > > > > > > > first access to any file in a subtree. Or maybe it wants to write to some
> > > > > > > > DB file containing current state or something like that.
> > > > > > > >
> > > > > > > > One solution I can imagine is to create an open flag that can be specified
> > > > > > > > on open which would result in the special behavior wrt fs freezing. If the
> > > > > > > > special behavior would be just trylocking the freeze protection then it
> > > > > > > > would be really easy. If the behaviour would be another freeze protection
> > > > > > > > level, then we'd need to make sure we don't generate another fanotify
> > > > > > > > permission event with such fd - autorejecting any such access is an obvious
> > > > > > > > solution but I'm not sure if practical for applications.
> > > > > > > >
> > > > > > >
> > > > > > > I had also considered marking the listener process with the FSNOTIFY
> > > > > > > context and enforcing this context on fanotify_read().
> > > > > > > In a way, this is similar to the NOIO and NOFS process context.
> > > > > > > It could be used to both act as a stronger form of FMODE_NONOTIFY
> > > > > > > and to activate the desired freeze protection behavior
> > > > > > > (whether trylock or SB_FREEZE_FSNOTIFY level).
> > > > > > >
> > > > > >
> > > > > > My feeling is that the best approach would be a PF_NOWAIT task flag:
> > > > > >
> > > > > > - PF_NOWAIT will prevent blocking on freeze protection
> > > > > > - PF_NOWAIT + FMODE_NOWAIT would imply RWF_NOWAIT
> > > > > > - PF_NOWAIT could be auto-set on the reader of a permission event
> > > > > > - PF_NOWAIT could be set on init of group FAN_CLASS_PRE_PATH
> > > > > > - We could add user API to set this personality explicitly to any task
> > > > > > - PF_NOWAIT without FMODE_NONOTIFY denies permission events
> > > > > >
> > > > > > Please let me know if you agree with this design and if so,
> > > > > > which of the methods to set PF_NOWAIT are a must for the first version
> > > > > > in your opinion?
> > > > >
> > > > > Yeah, the PF flag could work. It can be set for the process(es) responsible
> > > > > for processing the fanotify events and filling in filesystem contents. I
> > > > > don't think automatic setting of this flag is desirable though as it has
> > > > > quite wide impact and some of the consequences could be surprising.  I
> > > > > rather think it should be a conscious decision when setting up the process
> > > > > processing the events. So I think API to explicitly set / clear the flag
> > > > > would be the best. Also I think it would be better to capture in the name
> > > > > that this is really about fs freezing. So maybe PF_NOWAIT_FREEZE or
> > > > > something like that?
> > > > >
> > > >
> > > > Sure.
> > > >
> > > > > Also we were thinking about having an open(2) flag for this (instead of PF
> > > > > flag) in the past. That would allow finer granularity control of the
> > > > > behavior but I guess you are worried that it would not cover all the needed
> > > > > operations?
> > > > >
> > > >
> > > > Yeh, it seems like an API that is going to be harder to write safe HSM
> > > > programs with.
> > > >
> > > > > > Do you think we should use this method to fix the existing deadlocks
> > > > > > with FAN_OPEN_PERM and FAN_ACCESS_PERM? without opt-in?
> > > > >
> > > > > No, I think if someone cares about these, they should explicitly set the
> > > > > PF flag in their task processing the events.
> > > > >
> > > >
> > > > OK.
> > > >
> > > > I see an exit hatch in this statement -
> > > > If we are going leave the responsibility to avoid deadlock in corner
> > > > cases completely in the hands of the application, then I do not feel
> > > > morally obligated to create the PF_NOWAIT_FREEZE API *before*
> > > > providing the first HSM API.
> > > >
> > > > If the HSM application is running in a controlled system, on a filesystem
> > > > where fsfreeze is not expected or not needed, then a fully functional and
> > > > safe HSM does not require PF_NOWAIT_FREEZE API.
> > > >
> > > > Perhaps an API to make an fs unfreezable is just as practical and a much
> > > > easier option for the first version of HSM API?
> > > >
> > > > Imagine that HSM opens an fd and sends an EXCLUSIVE_FSFREEZER
> > > > ioctl. Then no other task can freeze the fs, for as long as the fd is open
> > > > apart from the HSM itself using this fd.
> > >
> > > This would mean you also prevent FREEZE_HOLDER_KERNEL requests which xfs
> > > uses for filesystem scrubbing iirc. I would reckon that you also run
> > > into problems with device mapper workloads where freeze/thaw requests
> > > from the block layer and into the filesystem layer are quite common.
> >
> > I agree. These cases will not play nicely with EXCLUSIVE_FSFREEZER.
> > The only case where the EXCLUSIVE_FSFREEZER API makes sense
> > is when the admin does not expect to meet any fsfreeze on the target fs and
> > wants to enforce that.
> >
> > >
> > > Have you given any thought to the idea - similar to a FUSE daemon - that
> > > you could register with a given filesystem as an HSM? Maybe integration
> > > like this is really undesirable for some reason but that may be an
> > > alternative.
> >
> > I am not sure what you mean by "register with a given filesystem"?
> > The comparison to FUSE daemon buffels me.  The main point with fanotify
> > HSM was for the user to be able to work natively on the target filesystem
> > without any "passthrough".
> >
> > FUSE passthrough is a valid way to implement HSM.
> > Many HSM already use FUSE and many HSM will continue to use FUSE.
> > Improving FUSE passthough performance (e.g. FUSE BPF) is another
> > way to improve HSM.
> >
> > Compared to fanotify HSM, FUSE passthrough is more versalite, but it
> > is also more resource expensive and some native fs features (e.g. ioctls)
> > will never work properly with FUSE passthrough.
> >
> > Not sure if that answers your question?
>
> This isn't about FUSE passthrough. Maybe the analogy doesn't work.
>
> What I just meant is similar to how fanotify registers itself as
> watching an inode or a mount or superblock one could have a new HSM
> watch type that lets the fs detect that it is watched by an HSM and then
> refuse to be frozen or other special behavior you might need. I don't
> know much about HSMs so I might just be talking nonsense.

Implementing mandatory anti-fsfreeze for any fs watches by HSM events
would be trivial and does not require specific filesystem integration.

I've already written a vfs API to advertise that filesystems are watches by
pre-modify HSM events in a later part of the series:
https://github.com/amir73il/linux/commit/88db3054b6bfa5ef4240175fa9efd6b3a871818c

However, if we do choose the solution of anti-fsfreeze,
I much prefer to leave it in the hands of userspace via
EXCLUSIVE_FSFREEZER API over mandatory anti-fsfreeze.

The main reason is that unlike what may be inferred from this thread,
HSM + fsfreeze CAN live quite well together, including HSM + xfs scrub,
HSM + LVM.

After the patches that are now in the vfs.rw branch, it takes much more
than just HSM + fsfreeze to cause a deadlock.

It requires HSM + fsfreeze + splice from a file on:
(
  - a nested overlayfs, whose lower^2 fs is on the "host" fs
  OR
  - a loop mounted filesystem, whose image file is on the "host" fs
)
AND
- the splice is to a file on the "host" fs

These two scenarios are not possible in a container, for example,
when the "host" fs is not exposed for write, directly or indirectly,
to the container.

And of course for many systems, those scenarios do not exist at all,
so there is no need for any anti-fsfreeze, not mandatory, nor user
controlled.

Thanks,
Amir.





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux