On Wed, Jun 28, 2023 at 8:09 PM Ahelenia Ziemiańska <nabijaczleweli@xxxxxxxxxxxxxxxxxx> wrote: > > On Wed, Jun 28, 2023 at 09:33:43AM +0300, Amir Goldstein wrote: > > On Tue, Jun 27, 2023 at 11:50 PM Ahelenia Ziemiańska > > <nabijaczleweli@xxxxxxxxxxxxxxxxxx> wrote: > > > The current behaviour caused an asymmetry where some write APIs > > > (write, sendfile) would notify the written-to/read-from objects, > > > but splice wouldn't. > > > > > > This affected userspace which uses inotify, most notably coreutils > > > tail -f, to monitor pipes. > > > If the pipe buffer had been filled by a splice-family function: > > > * tail wouldn't know and thus wouldn't service the pipe, and > > > * all writes to the pipe would block because it's full, > > > thus service was denied. > > > (For the particular case of tail -f this could be worked around > > > with ---disable-inotify.) > > Is my understanding of the tail code wrong? > > My understanding was that tail_forever_inotify() is not called for > > pipes, or is it being called when tailing a mixed collection of pipes > > and regular files? If there are subtleties like those you need to > > mention them , otherwise people will not be able to reproduce the > > problem that you are describing. > I can't squeak to the code itself, but it's trivial to check: > $ tail -f fifo & > [1] 3213996 > $ echo zupa > fifo > zupa > $ echo zupa > fifo > zupa > $ echo zupa > fifo > zupa > $ cat /bin/tail > fifo > # ... > $ cat /bin/tail > fifo > hangs: the fifo is being watched with inotify. > > This happens regardless of other files being specified. > > tail -f doesn't follow FIFOs or pipes if they're fd 0 > (guaranteed by POSIX, coreutils conforms). > > OTOH, you could theoretically do > $ cat | tail -f /dev/fd/3 3<&0 > which first reads from the pipe until completion (⇔ hangup, cat died), > then hangs, because it's waiting for more data on the pipe. > > This can never happen under a normal scenario, but doing > $ echo zupa > /proc/3238590/fd/3 > a few times reveals it's using classic 1/s polling > (and splicing to /proc/3238590/fd/3 actually yields that data being > output from tail). > > > I need to warn you about something regarding this patch - > > often there are colliding interests among different kernel users - > > fsnotify use cases quite often collide with the interest of users tracking > > performance regressions and IN_ACCESS/IN_MODIFY on anonymous pipes > > specifically have been the source of several performance regression reports > > in the past and have driven optimizations like: > > > > 71d734103edf ("fsnotify: Rearrange fast path to minimise overhead > > when there is no watcher") > > e43de7f0862b ("fsnotify: optimize the case of no marks of any type") > > > > The moral of this story is: even if your patches are accepted by fsnotify > > reviewers, once they are staged for merging they will be subject to > > performance regression tests and I can tell you with certainty that > > performance regression will not be tolerated for the tail -f use case. > > I will push your v4 patches to a branch in my github, to let the kernel > > test bots run the performance regressions on it whenever they get to it. > > > > Moreover, if coreutils will change tail -f to start setting inotify watches > > on anonymous pipes (my understanding is that currently does not?), > > then any tail -f on anonymous pipe can cripple the "no marks on sb" > > performance optimization for all anonymous pipes and that would be > > a *very* unfortunate outcome. > As seen above, it doesn't set inotify watches on anon pipes, and > (since it manages to distinguish "| /dev/fd/3 3<&0" from "fifo", > so it must be going further than S_ISFIFO(fstat())) > this is an explicit design decision. > > If you refuse setting inotifies on anon pipes then that likely won't > impact any userspace program (it's pathological, and for tail-like cases > it'd only be meaningful for magic /proc/$pid/fd/* symlinks), > and if it's in the name of performance then no-one'll likely complain, > or even notice. > Unfortunately, it doesn't work this way - most of the time we are not supposed to break existing applications and I have no way of knowing if those applications exist... > > I think we need to add a rule to fanotify_events_supported() to ban > > sb/mount marks on SB_KERNMOUNT and backport this > > fix to LTS kernels (I will look into it) and then we can fine tune > > the s_fsnotify_connectors optimization in fsnotify_parent() for > > the SB_KERNMOUNT special case. > > This may be able to save your patch for the faith of NACKed > > for performance regression. > This goes over my head, but if Jan says it makes sense > then it must do. > Here you go: https://github.com/amir73il/linux/commits/fsnotify_pipe I ended up using SB_NOUSER which is narrower than SB_KERNMOUNT. Care to test? 1) Functionally - that I did not break your tests. 2) Optimization - that when one anon pipe has an inotify watch write to another anon pipe stops at fsnotify_inode_has_watchers() and does not get to fsnotify(). > > > Generate modify out before access in to let inotify merge the > > > modify out events in thr ipipe case. > > This comment is not clear and does not belong in this context, > > but it very much belongs near the code in question. > Turned it into > /* > * Generate modify out before access in: > * do_splice_from() may've already sent modify out, > * and this ensures the events get merged. > */ > for v5. OK. Thanks, Amir.