On Wed, Aug 23, 2023 at 5:37 PM Jan Kara <jack@xxxxxxx> wrote: > > On Fri 18-08-23 10:01:40, Amir Goldstein wrote: > > [adding fsdevel] > > > > On Thu, Aug 17, 2023 at 9:22 PM Jan Kara <jack@xxxxxxx> wrote: > > > > > > On Thu 17-08-23 10:13:20, Amir Goldstein wrote: > > > > [CC Christian and Jens for the NOWAIT semantics] > > > > > > > > Jan, > > > > > > > > I was going to post start-write-safe patches [1], but now that this > > > > design issue has emerged, with your permission, I would like to > > > > take this discussion to fsdevel, so please reply to the list. > > > > > > > > For those who just joined, the context is fanotify HSM API [2] > > > > proposal and avoiding the fanotify deadlocks I described in my > > > > talk on LSFMM [3]. > > > > > > OK, sure. I'm resending the reply which I sent only to you here. > > > > > > > On Wed, Aug 16, 2023 at 8:18 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > > > > On Wed, Aug 16, 2023 at 12:47 PM Jan Kara <jack@xxxxxxx> wrote: > > > > > > On Mon 14-08-23 16:57:48, Amir Goldstein wrote: > > > > > > > On Mon, Jul 3, 2023 at 11:03 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > > > > > > > On Mon, Jul 3, 2023, 9:30 PM Jan Kara <jack@xxxxxxx> wrote: > > > > > > > do_sendfile() or ovl_copy_up() from ovl1 to xfs1, end up calling > > > > > > > do_splice_direct() with sb_writers(xfs1) held. > > > > > > > Internally, the splice operation calls into ovl_splice_read(), which > > > > > > > has to call the rw_verify_area() check with the fsnotify hook on the > > > > > > > underlying xfs file. > > > > > > > > > > > > Right, we can call rw_verify_area() only after overlayfs has told us what > > > > > > is actually the underlying file that is really used for reading. Hum, > > > > > > nasty. > > > > > > > > > > > > > This is a violation of start-write-safe permission hooks and the > > > > > > > lockdep_assert that I added in fsnotify_permission() catches this > > > > > > > violation. > > > > > > > > > > > > > > I believe that a similar issue exists with do_splice_direct() from > > > > > > > an fs that is loop mounted over an image file on xfs1 to xfs1. > > > > > > > > > > > > I don't see how that would be possible. If you have a loop image file on > > > > > > filesystem xfs1, then the filesystem stored in the image is some xfs2. > > > > > > Overlayfs case is special here because it doesn't really work with > > > > > > filesystems but rather directory subtrees and that causes the > > > > > > complications. > > > > > > > > > > > > > > > > I was referring to sendfile() from xfs2 to xfs1. > > > > > sb_writers of xfs1 is held, but loop needs to read from the image file > > > > > in xfs1. No? > > > > > > Yes, that seems possible and it would indeed trigger rw_verify_area() in > > > do_iter_read() on xfs1 while freeze protection for xfs1 is held. > > > > > > > Recap for new people joining this thread. > > > > The following deadlock is possible in upstream kernel > > if fanotify permission event handler tries to make > > modifications to the filesystem it is watching in the context > > of FAN_ACCESS_PERM handling in some cases: > > > > P1 P2 P3 > > ----------- ------------ ------------ > > do_sendfile(fs1.out_fd, fs1.in_fd) > > -> sb_start_write(fs1.sb) > > -> do_splice_direct() freeze_super(fs1.sb) > > -> rw_verify_area() -> sb_wait_write(fs1.sb) ...... > > -> security_file_permission() > > -> fsnotify_perm() --> FAN_ACCESS_PERM > > -> do_unlinkat(fs1.dfd, ...) > > -> sb_start_write(fs1.sb) ...... > > > > start-write-safe patches [1] (not posted) are trying to solve this > > deadlock and prepare the ground for a new set of permission events > > with cleaner/safer semantics. > > > > The cases described above of sendfile from a file in loop mounted > > image over fs1 or overlayfs over fs1 into a file in fs1 can still deadlock > > despite the start-write-safe patches [1]. > > Yep, nice summary. > > > > > > > > My earlier patches had annotated the rw_verify_area() calls > > > > > > > in splice iterators as "MAY_NOT_START_WRITE" and the > > > > > > > userspace event listener was notified via flag whether modifying > > > > > > > the content of the file was allowed or not. > > > > > > > > > > > > > > I do not care so much about HSM being able to fill content of files > > > > > > > from a nested context like this, but we do need some way for > > > > > > > userspace to at least deny this access to a file with no content. > > > > > > > > > > > > > > Another possibility I thought of is to change file_start_write() > > > > > > > do use file_start_write_trylock() for files with FMODE_NONOTIFY. > > > > > > > This should make it safe to fill file content when event is generated > > > > > > > with sb_writers held (if freeze is in progress modification will fail). > > > > > > > Right? > > > > > > > > > > > > OK, so you mean that the HSM managing application will get an fd with > > > > > > FMODE_NONOTIFY set from an event and use it for filling in the file > > > > > > contents and the kernel functions grabbing freeze protection will detect > > > > > > the file flag and bail with error instead of waiting? That sounds like an > > > > > > attractive solution - the HSM managing app could even reply with error like > > > > > > ERESTARTSYS to fanotify event and make the syscall restart (which will > > > > > > block until the fs is unfrozen and then we can try again) and thus handle > > > > > > the whole problem transparently for the application generating the event. > > > > > > But I'm just dreaming now, for start it would be fine to just fail the > > > > > > syscall. > > > > > > > > > > > > > > > > IMO, a temporary error from an HSM controlled fs is not a big deal. > > > > > Same as a temporary error from a network fs or FUSE - should be > > > > > tolerable when the endpoint is not connected. > > > > > One of my patches allows HSM returning an error that is not EPERM as > > > > > response - this can be useful in such situations. > > > > > > OK. > > > > > > > > > I see only three possible problems with the solution. Firstly, the HSM > > > > > > application will have to be careful to only access the managed filesystem > > > > > > with the fd returned from fanotify event as otherwise it could deadlock on > > > > > > frozen filesystem. > > > > > > > > > > Isn't that already the case to some extent? > > > > > It is not wise for permission event handlers to perform operations > > > > > on fd without FMODE_NONOTIFY. > > > > > > Yes, it isn't a new problem. The amount of bug reports in our bugzilla > > > boiling down to this kind of self-deadlock just shows that fanotify users > > > get this wrong all the time. > > > > > > > > > That may seem obvious but practice shows that with > > > > > > complex software stacks with many dependencies, this is far from trivial. > > > > > > > > > > It will be especially important when we have permission events > > > > > on directory operations that need to perform operations on O_PATH > > > > > dirfd with FMODE_NONOTIFY. > > > > > > > > > > > Secondly, conditioning the trylock behavior on FMODE_NONOTIFY seems > > > > > > somewhat arbitary unless you understand our implementation issues and > > > > > > possibly it could regress current unsuspecting users. So I'm thinking > > > > > > whether we shouldn't rather have an explicit open flag requiring erroring > > > > > > out on frozen filesystem instead of blocking and the HSM application will > > > > > > need to use it to evade freezing deadlocks. Or we can just depend on > > > > > > RWF_NOWAIT flag (we currently block on frozen filesystem despite this flag > > > > > > but that can be viewed as a bug) but that's limited to writes (i.e., no way > > > > > > to e.g. do fallocate(2) without blocking on frozen fs). > > > > > > > > > > User cannot ask for fd with FMODE_NONOTIFY as it is - this is provided > > > > > as a means to an end by fanotify - so it would not be much different if > > > > > the new events would provide an fd with FMODE_NONOTIFY | > > > > > FMODE_NOWAIT. It will be up to documentation to say what is and what > > > > > is not allowed with the event->fd provided by fanotify. > > > > > > > > > > > > > This part needs clarifying. > > > > Technically, we can use the flag FMODE_NOWAIT to prevent waiting in > > > > file_start_write() *when* it is combined with FMODE_NONOTIFY. > > > > > > > > Yes, it would be a change of behavior, but I think it would be a good change, > > > > because current event->fd from FAN_ACCESS_PERM events is really not > > > > write-safe (could deadlock with freezing fs). > > > > > > As I wrote above I don't like the abuse of FMODE_NONOTIFY much. > > > FMODE_NONOTIFY means we shouldn't generate new fanotify events when using > > > this fd. It says nothing about freeze handling or so. Furthermore as you > > > observe FMODE_NONOTIFY cannot be set by userspace but practically all > > > current fanotify users need to also do IO on other files in order to handle > > > fanotify event. So ideally we'd have a way to do IO to other files in a > > > manner safe wrt freezing. We could just update handling of RWF_NOWAIT flag > > > to only trylock freeze protection - that actually makes a lot of sense to > > > me. The question is whether this is enough or not. > > > > > > > Maybe, but RWF_NOWAIT doesn't take us far enough, because writing > > to a file is not the only thing that HSM needs to do. > > Eventually, event handler for lookup permission events should be > > able to also create files without blocking on vfs level freeze protection. > > So this is what I wanted to clarify. The lookup permission event never gets > called under a freeze protection so the deadlock doesn't exist there. In > principle the problem exists only for access and modify events where we'd > be filling in file data and thus RWF_NOWAIT could be enough. Yes, you are right. It is possible that RWF_NOWAIT could be enough. But the discovery of the loop/ovl corner cases has shaken my confidence is the ability to guarantee that freeze protection is not held somehow indirectly. If I am not mistaken, FAN_OPEN_PERM suffers from the exact same ovl corner case, because with splice from ovl1 to fs1, fs1 freeze protection is held and: ovl_splice_read(ovl1.file) ovl_real_fdget() ovl_open_realfile(fs1.file) ... security_file_open(fs1.file) > That being > said I understand this may be assuming too much about the implementations > of HSM daemons and as you write, we might want to provide a way to do IO > not blocking on freeze protection from any hook. But I wanted to point this > out explicitely so that it's a conscious decision. > > > In theory, I am not saying we should do it, but as a thought experiment: > > if the requirement from permission event handler is that is must use a > > O_PATH | FMODE_NONOTIFY event->fd provided in the event to make > > any filesystem modifications, then instead of aiming for NOWAIT > > semantics using sb_start_write_trylock(), we could use a freeze level > > SB_FREEZE_FSNOTIFY between > > SB_FREEZE_WRITE and SB_FREEZE_PAGEFAULT. > > > > As a matter of fact, HSM is kind of a "VFS FAULT", so as long as we > > make it clear how userspace should avoid nesting "VFS faults" there is > > a model that can solve the deadlock correctly. > > OK, yes, in principle another freeze level which could be used by handlers > of fanotify permission events would solve the deadlock as well. Just you > seem to like to tie this functionality to the particular fd returned from > fanotify and I'm not convinced that is a good idea. What if the application > needs to do write to some other location besides the one fd it got passed > from fanotify event? E.g. imagine it wants to fetch a whole subtree on > first access to any file in a subtree. Or maybe it wants to write to some > DB file containing current state or something like that. > > One solution I can imagine is to create an open flag that can be specified > on open which would result in the special behavior wrt fs freezing. If the > special behavior would be just trylocking the freeze protection then it > would be really easy. If the behaviour would be another freeze protection > level, then we'd need to make sure we don't generate another fanotify > permission event with such fd - autorejecting any such access is an obvious > solution but I'm not sure if practical for applications. > I had also considered marking the listener process with the FSNOTIFY context and enforcing this context on fanotify_read(). In a way, this is similar to the NOIO and NOFS process context. It could be used to both act as a stronger form of FMODE_NONOTIFY and to activate the desired freeze protection behavior (whether trylock or SB_FREEZE_FSNOTIFY level). > > > > Then we have two options: > > > > 1. Generate "write-safe" FAN_PRE_ACCESS events only for fs that set > > > > FMODE_NOWAIT. > > > > Other fs will still generate the legacy FAN_ACCESS_PERM events > > > > which will be documented as write-unsafe > > > > 2. Use a new internal flag (e.g. FMODE_NOSBWAIT) for the stronger > > > > NOWAIT semantics that fanotify will always set on event->fd for the > > > > new write-safe FAN_PRE_ACCESS events > > > > > > > > TBH, the backing fs for HSM [2] is anyway supposed to be a "normal" > > > > local fs and I'd be more comfortable with fs opting in to support fanotify > > > > HSM events, so option #1 doesn't seem like a terrible idea?? > > > > > > Yes, I don't think 1) would be really be a limitation that would matter too > > > much in practice. > > > > > > > > Currently, the documentation is missing, because there are operations > > > > > that are not really safe in the permission event context, but there is no > > > > > documentation about that. > > > > > > > > > > > Thirdly, unless we > > > > > > propagate to the HSM app the information whether the freeze protection is > > > > > > held in the kernel or not, it doesn't know whether it should just wait for > > > > > > the filesystem to unfreeze or whether it should rather fail the request to > > > > > > avoid the deadlock. Hrm... > > > > > > > > > > informing HSM if freeze protection is held by this thread may be a little > > > > > challenging, but it is easy for me to annotate possible risky contexts > > > > > like the hooks inside splice read. > > > > > I am just not sure that waiting in HSM context is that important and > > > > > if it is not better to always fail in the frozen fs case. > > > > > > Always failing in frozen fs case is certainly possible but that will make > > > fs freezing a bit non-transparent - the application may treat such failures > > > as fatal errors and abort. So it's ok for the first POC but eventually we > > > should have a plan how we could make fs freezing transparent for the > > > applications even for HSM managed filesystems. > > > > > > > OK. ATM, the only solution I can think of that is both maintainable > > and lets HSM live in complete harmony with fsfreeze is adding the > > extra SB_FREEZE_FSNOTIFY level. > > To make things clear: if the only problems would be with those sendfile(2) > rare corner-cases, then I guess we can live with that and implement retry > in the kernel if userspace ever complains about unexpected short copy or > EAGAIN... The problem I see is that if we advise that all IO from the > fanotify event handler should happen in the freeze-safe manner, then with > the non-blocking solution all HSM IO suddently starts failing as soon as > the filesystem is frozen. And that is IMHO not nice. I see what you mean. The SB_FREEZE_FSNOTIFY design is much more clear in that respect. > > I am not sure how big of an overhead that would be? > > I imagine that sb_writers is large enough as it is w.r.t fitting into > > cache lines? > > I don't think that it adds much complexity or maintenance burden > > to vfs?? I'm really not sure. > > Well, the overhead is effectively one percpu counter per superblock. > Negligible in terms of CPU time, somewhat annoying in terms of memory but > bearable. So this may be a way forward. > Considering that I had already spend some time optimizing out the memory and performance overhead of s_write_srcu: https://github.com/amir73il/linux/commit/655606ca8cffba7636959547dc094651cef56f4d I guess I may be able to also optimize out the SB_FREEZE_FSNOTIFY percpu counter under the same object that is allocated lazily for sb on the first mark with the new write-safe fanotify events. Ok. I have something to try.. Thanks! Amir.