Re: [PATCH v4 1/3] splice: always fsnotify_access(in), fsnotify_modify(out) on success

Amir Goldstein <amir73il@xxxxxxxxx> · Wed, 28 Jun 2023 21:38:03 +0300

On Wed, Jun 28, 2023 at 8:09 PM Ahelenia Ziemiańska
<nabijaczleweli@xxxxxxxxxxxxxxxxxx> wrote:
>
> On Wed, Jun 28, 2023 at 09:33:43AM +0300, Amir Goldstein wrote:
> > On Tue, Jun 27, 2023 at 11:50 PM Ahelenia Ziemiańska
> > <nabijaczleweli@xxxxxxxxxxxxxxxxxx> wrote:
> > > The current behaviour caused an asymmetry where some write APIs
> > > (write, sendfile) would notify the written-to/read-from objects,
> > > but splice wouldn't.
> > >
> > > This affected userspace which uses inotify, most notably coreutils
> > > tail -f, to monitor pipes.
> > > If the pipe buffer had been filled by a splice-family function:
> > >   * tail wouldn't know and thus wouldn't service the pipe, and
> > >   * all writes to the pipe would block because it's full,
> > > thus service was denied.
> > > (For the particular case of tail -f this could be worked around
> > >  with ---disable-inotify.)
> > Is my understanding of the tail code wrong?
> > My understanding was that tail_forever_inotify() is not called for
> > pipes, or is it being called when tailing a mixed collection of pipes
> > and regular files? If there are subtleties like those you need to
> > mention them , otherwise people will not be able to reproduce the
> > problem that you are describing.
> I can't squeak to the code itself, but it's trivial to check:
>   $ tail -f  fifo &
>   [1] 3213996
>   $ echo zupa > fifo
>   zupa
>   $ echo zupa > fifo
>   zupa
>   $ echo zupa > fifo
>   zupa
>   $ cat /bin/tail > fifo
>   # ...
>   $ cat /bin/tail > fifo
> hangs: the fifo is being watched with inotify.
>
> This happens regardless of other files being specified.
>
> tail -f doesn't follow FIFOs or pipes if they're fd 0
> (guaranteed by POSIX, coreutils conforms).
>
> OTOH, you could theoretically do
>   $ cat | tail -f /dev/fd/3 3<&0
> which first reads from the pipe until completion (⇔ hangup, cat died),
> then hangs, because it's waiting for more data on the pipe.
>
> This can never happen under a normal scenario, but doing
>   $ echo zupa > /proc/3238590/fd/3
> a few times reveals it's using classic 1/s polling
> (and splicing to /proc/3238590/fd/3 actually yields that data being
>  output from tail).
>
> > I need to warn you about something regarding this patch -
> > often there are colliding interests among different kernel users -
> > fsnotify use cases quite often collide with the interest of users tracking
> > performance regressions and IN_ACCESS/IN_MODIFY on anonymous pipes
> > specifically have been the source of several performance regression reports
> > in the past and have driven optimizations like:
> >
> > 71d734103edf ("fsnotify: Rearrange fast path to minimise overhead
> > when there is no watcher")
> > e43de7f0862b ("fsnotify: optimize the case of no marks of any type")
> >
> > The moral of this story is: even if your patches are accepted by fsnotify
> > reviewers, once they are staged for merging they will be subject to
> > performance regression tests and I can tell you with certainty that
> > performance regression will not be tolerated for the tail -f use case.
> > I will push your v4 patches to a branch in my github, to let the kernel
> > test bots run the performance regressions on it whenever they get to it.
> >
> > Moreover, if coreutils will change tail -f to start setting inotify watches
> > on anonymous pipes (my understanding is that currently does not?),
> > then any tail -f on anonymous pipe can cripple the "no marks on sb"
> > performance optimization for all anonymous pipes and that would be
> > a *very* unfortunate outcome.
> As seen above, it doesn't set inotify watches on anon pipes, and
> (since it manages to distinguish "| /dev/fd/3 3<&0" from "fifo",
>  so it must be going further than S_ISFIFO(fstat()))
> this is an explicit design decision.
>
> If you refuse setting inotifies on anon pipes then that likely won't
> impact any userspace program (it's pathological, and for tail-like cases
>  it'd only be meaningful for magic /proc/$pid/fd/* symlinks),
> and if it's in the name of performance then no-one'll likely complain,
> or even notice.
>

Unfortunately, it doesn't work this way - most of the time we are not
supposed to break existing applications and I have no way of knowing if
those applications exist...

> > I think we need to add a rule to fanotify_events_supported() to ban
> > sb/mount marks on SB_KERNMOUNT and backport this
> > fix to LTS kernels (I will look into it) and then we can fine tune
> > the s_fsnotify_connectors optimization in fsnotify_parent() for
> > the SB_KERNMOUNT special case.
> > This may be able to save your patch for the faith of NACKed
> > for performance regression.
> This goes over my head, but if Jan says it makes sense
> then it must do.
>

Here you go:
https://github.com/amir73il/linux/commits/fsnotify_pipe

I ended up using SB_NOUSER which is narrower than
SB_KERNMOUNT.

Care to test?
1) Functionally - that I did not break your tests.
2) Optimization - that when one anon pipe has an inotify watch
write to another anon pipe stops at fsnotify_inode_has_watchers()
and does not get to fsnotify().

> > > Generate modify out before access in to let inotify merge the
> > > modify out events in thr ipipe case.
> > This comment is not clear and does not belong in this context,
> > but it very much belongs near the code in question.
> Turned it into
>                 /*
>                  * Generate modify out before access in:
>                  * do_splice_from() may've already sent modify out,
>                  * and this ensures the events get merged.
>                  */
> for v5.

OK.

Thanks,
Amir.