On Mon, Nov 27, 2023 at 04:48:23PM +0200, Amir Goldstein wrote: > On Mon, Nov 27, 2023 at 3:56 PM Christian Brauner <brauner@xxxxxxxxxx> wrote: > > > > On Mon, Nov 20, 2023 at 06:59:47PM +0200, Amir Goldstein wrote: > > > On Mon, Nov 20, 2023 at 4:06 PM Jan Kara <jack@xxxxxxx> wrote: > > > > > > > > Hi Amir, > > > > > > > > sorry for a bit delayed reply, I did not get to "swapping in" HSM > > > > discussion during the Plumbers conference :) > > > > > > > > On Mon 13-11-23 13:50:03, Amir Goldstein wrote: > > > > > On Wed, Aug 23, 2023 at 7:31 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > > > > > On Wed, Aug 23, 2023 at 5:37 PM Jan Kara <jack@xxxxxxx> wrote: > > > > > > > > Recap for new people joining this thread. > > > > > > > > > > > > > > > > The following deadlock is possible in upstream kernel > > > > > > > > if fanotify permission event handler tries to make > > > > > > > > modifications to the filesystem it is watching in the context > > > > > > > > of FAN_ACCESS_PERM handling in some cases: > > > > > > > > > > > > > > > > P1 P2 P3 > > > > > > > > ----------- ------------ ------------ > > > > > > > > do_sendfile(fs1.out_fd, fs1.in_fd) > > > > > > > > -> sb_start_write(fs1.sb) > > > > > > > > -> do_splice_direct() freeze_super(fs1.sb) > > > > > > > > -> rw_verify_area() -> sb_wait_write(fs1.sb) ...... > > > > > > > > -> security_file_permission() > > > > > > > > -> fsnotify_perm() --> FAN_ACCESS_PERM > > > > > > > > -> do_unlinkat(fs1.dfd, ...) > > > > > > > > -> sb_start_write(fs1.sb) ...... > > > > > > > > > > > > > > > > start-write-safe patches [1] (not posted) are trying to solve this > > > > > > > > deadlock and prepare the ground for a new set of permission events > > > > > > > > with cleaner/safer semantics. > > > > > > > > > > > > > > > > The cases described above of sendfile from a file in loop mounted > > > > > > > > image over fs1 or overlayfs over fs1 into a file in fs1 can still > > > > > > > > deadlock despite the start-write-safe patches [1]. > > > > > > > > > > > > > > Yep, nice summary. > > > > ... > > > > > > > > > As I wrote above I don't like the abuse of FMODE_NONOTIFY much. > > > > > > > > > FMODE_NONOTIFY means we shouldn't generate new fanotify events when using > > > > > > > > > this fd. It says nothing about freeze handling or so. Furthermore as you > > > > > > > > > observe FMODE_NONOTIFY cannot be set by userspace but practically all > > > > > > > > > current fanotify users need to also do IO on other files in order to handle > > > > > > > > > fanotify event. So ideally we'd have a way to do IO to other files in a > > > > > > > > > manner safe wrt freezing. We could just update handling of RWF_NOWAIT flag > > > > > > > > > to only trylock freeze protection - that actually makes a lot of sense to > > > > > > > > > me. The question is whether this is enough or not. > > > > > > > > > > > > > > > > > > > > > > > > > Maybe, but RWF_NOWAIT doesn't take us far enough, because writing > > > > > > > > to a file is not the only thing that HSM needs to do. > > > > > > > > Eventually, event handler for lookup permission events should be > > > > > > > > able to also create files without blocking on vfs level freeze protection. > > > > > > > > > > > > > > So this is what I wanted to clarify. The lookup permission event never gets > > > > > > > called under a freeze protection so the deadlock doesn't exist there. In > > > > > > > principle the problem exists only for access and modify events where we'd > > > > > > > be filling in file data and thus RWF_NOWAIT could be enough. > > > > > > > > > > > > Yes, you are right. > > > > > > It is possible that RWF_NOWAIT could be enough. > > > > > > > > > > > > But the discovery of the loop/ovl corner cases has shaken my > > > > > > confidence is the ability to guarantee that freeze protection is not > > > > > > held somehow indirectly. > > > > > > > > > > > > If I am not mistaken, FAN_OPEN_PERM suffers from the exact > > > > > > same ovl corner case, because with splice from ovl1 to fs1, > > > > > > fs1 freeze protection is held and: > > > > > > ovl_splice_read(ovl1.file) > > > > > > ovl_real_fdget() > > > > > > ovl_open_realfile(fs1.file) > > > > > > ... security_file_open(fs1.file) > > > > > > > > > > > > > That being > > > > > > > said I understand this may be assuming too much about the implementations > > > > > > > of HSM daemons and as you write, we might want to provide a way to do IO > > > > > > > not blocking on freeze protection from any hook. But I wanted to point this > > > > > > > out explicitly so that it's a conscious decision. > > > > > > > > > > > > > > > > > I agree and I'd like to explain using an example, why RWF_NOWAIT is > > > > > not enough for HSM needs. > > > > > > > > > > The reason is that often, when HSM needs to handle filling content > > > > > in FAN_PRE_ACCESS, it is not just about writing to the accessed file. > > > > > HSM needs to be able to avoid blocking on freeze protection > > > > > for any operations on the filesystem, not just pwrite(). > > > > > > > > > > For example, the POC HSM code [1], stores the DATA_DIR_fd > > > > > from the lookup event and uses it in the handling of access events to > > > > > update the metadata files that store which parts of the file were already > > > > > filled (relying of fiemap is not always a valid option). > > > > > > > > > > That is the reason that in the POC patches [2], FMODE_NONOTIFY > > > > > is propagated from dirfd to an fd opened with openat(dirfd, ...), so > > > > > HSM has an indirect way to get a FMODE_NONOTIFY fd on any file. > > > > > > > > > > Another use case is that HSM may want to download content to a > > > > > temp file on the same filesystem, verify the downloaded content and > > > > > then clone the data into the accessed file range. > > > > > > > > > > I think that a PF_ flag (see below) would work best for all those cases. > > > > > > > > Ok, I agree that just using RWF_NOWAIT from the HSM daemon need not be > > > > enough for all sensible usecases to avoid deadlocks with freezing. However > > > > note that if we want to really properly handle all possible operations, we > > > > need to start handling error from all sb_start_write() and > > > > file_start_write() calls and there are quite a few of those. > > > > > > > > > > Darn, forgot about those. > > > I am starting to reconsider adding a freeze level. > > > I cannot shake the feeling that there is a simpler solution that escapes us... > > > Maybe fs anti-freeze (see blow). > > > > > > > > > > > In theory, I am not saying we should do it, but as a thought experiment: > > > > > > > > if the requirement from permission event handler is that is must use a > > > > > > > > O_PATH | FMODE_NONOTIFY event->fd provided in the event to make > > > > > > > > any filesystem modifications, then instead of aiming for NOWAIT > > > > > > > > semantics using sb_start_write_trylock(), we could use a freeze level > > > > > > > > SB_FREEZE_FSNOTIFY between > > > > > > > > SB_FREEZE_WRITE and SB_FREEZE_PAGEFAULT. > > > > > > > > > > > > > > > > As a matter of fact, HSM is kind of a "VFS FAULT", so as long as we > > > > > > > > make it clear how userspace should avoid nesting "VFS faults" there is > > > > > > > > a model that can solve the deadlock correctly. > > > > > > > > > > > > > > OK, yes, in principle another freeze level which could be used by handlers > > > > > > > of fanotify permission events would solve the deadlock as well. Just you > > > > > > > seem to like to tie this functionality to the particular fd returned from > > > > > > > fanotify and I'm not convinced that is a good idea. What if the application > > > > > > > needs to do write to some other location besides the one fd it got passed > > > > > > > from fanotify event? E.g. imagine it wants to fetch a whole subtree on > > > > > > > first access to any file in a subtree. Or maybe it wants to write to some > > > > > > > DB file containing current state or something like that. > > > > > > > > > > > > > > One solution I can imagine is to create an open flag that can be specified > > > > > > > on open which would result in the special behavior wrt fs freezing. If the > > > > > > > special behavior would be just trylocking the freeze protection then it > > > > > > > would be really easy. If the behaviour would be another freeze protection > > > > > > > level, then we'd need to make sure we don't generate another fanotify > > > > > > > permission event with such fd - autorejecting any such access is an obvious > > > > > > > solution but I'm not sure if practical for applications. > > > > > > > > > > > > > > > > > > > I had also considered marking the listener process with the FSNOTIFY > > > > > > context and enforcing this context on fanotify_read(). > > > > > > In a way, this is similar to the NOIO and NOFS process context. > > > > > > It could be used to both act as a stronger form of FMODE_NONOTIFY > > > > > > and to activate the desired freeze protection behavior > > > > > > (whether trylock or SB_FREEZE_FSNOTIFY level). > > > > > > > > > > > > > > > > My feeling is that the best approach would be a PF_NOWAIT task flag: > > > > > > > > > > - PF_NOWAIT will prevent blocking on freeze protection > > > > > - PF_NOWAIT + FMODE_NOWAIT would imply RWF_NOWAIT > > > > > - PF_NOWAIT could be auto-set on the reader of a permission event > > > > > - PF_NOWAIT could be set on init of group FAN_CLASS_PRE_PATH > > > > > - We could add user API to set this personality explicitly to any task > > > > > - PF_NOWAIT without FMODE_NONOTIFY denies permission events > > > > > > > > > > Please let me know if you agree with this design and if so, > > > > > which of the methods to set PF_NOWAIT are a must for the first version > > > > > in your opinion? > > > > > > > > Yeah, the PF flag could work. It can be set for the process(es) responsible > > > > for processing the fanotify events and filling in filesystem contents. I > > > > don't think automatic setting of this flag is desirable though as it has > > > > quite wide impact and some of the consequences could be surprising. I > > > > rather think it should be a conscious decision when setting up the process > > > > processing the events. So I think API to explicitly set / clear the flag > > > > would be the best. Also I think it would be better to capture in the name > > > > that this is really about fs freezing. So maybe PF_NOWAIT_FREEZE or > > > > something like that? > > > > > > > > > > Sure. > > > > > > > Also we were thinking about having an open(2) flag for this (instead of PF > > > > flag) in the past. That would allow finer granularity control of the > > > > behavior but I guess you are worried that it would not cover all the needed > > > > operations? > > > > > > > > > > Yeh, it seems like an API that is going to be harder to write safe HSM > > > programs with. > > > > > > > > Do you think we should use this method to fix the existing deadlocks > > > > > with FAN_OPEN_PERM and FAN_ACCESS_PERM? without opt-in? > > > > > > > > No, I think if someone cares about these, they should explicitly set the > > > > PF flag in their task processing the events. > > > > > > > > > > OK. > > > > > > I see an exit hatch in this statement - > > > If we are going leave the responsibility to avoid deadlock in corner > > > cases completely in the hands of the application, then I do not feel > > > morally obligated to create the PF_NOWAIT_FREEZE API *before* > > > providing the first HSM API. > > > > > > If the HSM application is running in a controlled system, on a filesystem > > > where fsfreeze is not expected or not needed, then a fully functional and > > > safe HSM does not require PF_NOWAIT_FREEZE API. > > > > > > Perhaps an API to make an fs unfreezable is just as practical and a much > > > easier option for the first version of HSM API? > > > > > > Imagine that HSM opens an fd and sends an EXCLUSIVE_FSFREEZER > > > ioctl. Then no other task can freeze the fs, for as long as the fd is open > > > apart from the HSM itself using this fd. > > > > This would mean you also prevent FREEZE_HOLDER_KERNEL requests which xfs > > uses for filesystem scrubbing iirc. I would reckon that you also run > > into problems with device mapper workloads where freeze/thaw requests > > from the block layer and into the filesystem layer are quite common. > > I agree. These cases will not play nicely with EXCLUSIVE_FSFREEZER. > The only case where the EXCLUSIVE_FSFREEZER API makes sense > is when the admin does not expect to meet any fsfreeze on the target fs and > wants to enforce that. > > > > > Have you given any thought to the idea - similar to a FUSE daemon - that > > you could register with a given filesystem as an HSM? Maybe integration > > like this is really undesirable for some reason but that may be an > > alternative. > > I am not sure what you mean by "register with a given filesystem"? > The comparison to FUSE daemon buffels me. The main point with fanotify > HSM was for the user to be able to work natively on the target filesystem > without any "passthrough". > > FUSE passthrough is a valid way to implement HSM. > Many HSM already use FUSE and many HSM will continue to use FUSE. > Improving FUSE passthough performance (e.g. FUSE BPF) is another > way to improve HSM. > > Compared to fanotify HSM, FUSE passthrough is more versalite, but it > is also more resource expensive and some native fs features (e.g. ioctls) > will never work properly with FUSE passthrough. > > Not sure if that answers your question? This isn't about FUSE passthrough. Maybe the analogy doesn't work. What I just meant is similar to how fanotify registers itself as watching an inode or a mount or superblock one could have a new HSM watch type that lets the fs detect that it is watched by an HSM and then refuse to be frozen or other special behavior you might need. I don't know much about HSMs so I might just be talking nonsense.