On Wed, Jun 28, 2023 at 09:33:43AM +0300, Amir Goldstein wrote: > On Tue, Jun 27, 2023 at 11:50 PM Ahelenia Ziemiańska > <nabijaczleweli@xxxxxxxxxxxxxxxxxx> wrote: > > The current behaviour caused an asymmetry where some write APIs > > (write, sendfile) would notify the written-to/read-from objects, > > but splice wouldn't. > > > > This affected userspace which uses inotify, most notably coreutils > > tail -f, to monitor pipes. > > If the pipe buffer had been filled by a splice-family function: > > * tail wouldn't know and thus wouldn't service the pipe, and > > * all writes to the pipe would block because it's full, > > thus service was denied. > > (For the particular case of tail -f this could be worked around > > with ---disable-inotify.) > Is my understanding of the tail code wrong? > My understanding was that tail_forever_inotify() is not called for > pipes, or is it being called when tailing a mixed collection of pipes > and regular files? If there are subtleties like those you need to > mention them , otherwise people will not be able to reproduce the > problem that you are describing. I can't squeak to the code itself, but it's trivial to check: $ tail -f fifo & [1] 3213996 $ echo zupa > fifo zupa $ echo zupa > fifo zupa $ echo zupa > fifo zupa $ cat /bin/tail > fifo # ... $ cat /bin/tail > fifo hangs: the fifo is being watched with inotify. This happens regardless of other files being specified. tail -f doesn't follow FIFOs or pipes if they're fd 0 (guaranteed by POSIX, coreutils conforms). OTOH, you could theoretically do $ cat | tail -f /dev/fd/3 3<&0 which first reads from the pipe until completion (⇔ hangup, cat died), then hangs, because it's waiting for more data on the pipe. This can never happen under a normal scenario, but doing $ echo zupa > /proc/3238590/fd/3 a few times reveals it's using classic 1/s polling (and splicing to /proc/3238590/fd/3 actually yields that data being output from tail). > I need to warn you about something regarding this patch - > often there are colliding interests among different kernel users - > fsnotify use cases quite often collide with the interest of users tracking > performance regressions and IN_ACCESS/IN_MODIFY on anonymous pipes > specifically have been the source of several performance regression reports > in the past and have driven optimizations like: > > 71d734103edf ("fsnotify: Rearrange fast path to minimise overhead > when there is no watcher") > e43de7f0862b ("fsnotify: optimize the case of no marks of any type") > > The moral of this story is: even if your patches are accepted by fsnotify > reviewers, once they are staged for merging they will be subject to > performance regression tests and I can tell you with certainty that > performance regression will not be tolerated for the tail -f use case. > I will push your v4 patches to a branch in my github, to let the kernel > test bots run the performance regressions on it whenever they get to it. > > Moreover, if coreutils will change tail -f to start setting inotify watches > on anonymous pipes (my understanding is that currently does not?), > then any tail -f on anonymous pipe can cripple the "no marks on sb" > performance optimization for all anonymous pipes and that would be > a *very* unfortunate outcome. As seen above, it doesn't set inotify watches on anon pipes, and (since it manages to distinguish "| /dev/fd/3 3<&0" from "fifo", so it must be going further than S_ISFIFO(fstat())) this is an explicit design decision. If you refuse setting inotifies on anon pipes then that likely won't impact any userspace program (it's pathological, and for tail-like cases it'd only be meaningful for magic /proc/$pid/fd/* symlinks), and if it's in the name of performance then no-one'll likely complain, or even notice. > I think we need to add a rule to fanotify_events_supported() to ban > sb/mount marks on SB_KERNMOUNT and backport this > fix to LTS kernels (I will look into it) and then we can fine tune > the s_fsnotify_connectors optimization in fsnotify_parent() for > the SB_KERNMOUNT special case. > This may be able to save your patch for the faith of NACKed > for performance regression. This goes over my head, but if Jan says it makes sense then it must do. > > Generate modify out before access in to let inotify merge the > > modify out events in thr ipipe case. > This comment is not clear and does not belong in this context, > but it very much belongs near the code in question. Turned it into /* * Generate modify out before access in: * do_splice_from() may've already sent modify out, * and this ensures the events get merged. */ for v5.
Attachment:
signature.asc
Description: PGP signature