Re: [PATCH v3] errseq: split the ERRSEQ_SEEN flag into two new flags

Amir Goldstein <amir73il@xxxxxxxxx> · Sat, 19 Dec 2020 11:09:50 +0200

On Sat, Dec 19, 2020 at 3:03 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>
> On Fri, 2020-12-18 at 23:44 +0000, Sargun Dhillon wrote:
> > On Thu, Dec 17, 2020 at 04:18:49PM -0500, Jeff Layton wrote:
> > > On Thu, 2020-12-17 at 20:35 +0000, Sargun Dhillon wrote:
> > > > On Thu, Dec 17, 2020 at 10:00:37AM -0500, Jeff Layton wrote:
> > > > > Overlayfs's volatile mounts want to be able to sample an error for their
> > > > > own purposes, without preventing a later opener from potentially seeing
> > > > > the error.
> > > > >
> > > > > The original reason for the ERRSEQ_SEEN flag was to make it so that we
> > > > > didn't need to increment the counter if nothing had observed the latest
> > > > > value and the error was the same. Eventually, a regression was reported
> > > > > in the errseq_t conversion, and we fixed that by using the ERRSEQ_SEEN
> > > > > flag to also mean that the error had been reported to userland at least
> > > > > once somewhere.
> > > > >
> > > > > Those are two different states, however. If we instead take a second
> > > > > flag bit from the counter, we can track these two things separately, and
> > > > > accomodate the overlayfs volatile mount use-case.
> > > > >
> > > > > Rename the ERRSEQ_SEEN flag to ERRSEQ_OBSERVED and use that to indicate
> > > > > that the counter must be incremented the next time an error is set.
> > > > > Also, add a new ERRSEQ_REPORTED flag that indicates whether the current
> > > > > error was returned to userland (and thus doesn't need to be reported on
> > > > > newly open file descriptions).
> > > > >
> > > > > Test only for the OBSERVED bit when deciding whether to increment the
> > > > > counter and only for the REPORTED bit when deciding what to return in
> > > > > errseq_sample.
> > > > >
> > > > > Add a new errseq_peek function to allow for the overlayfs use-case.
> > > > > This just grabs the latest counter and sets the OBSERVED bit, leaving the
> > > > > REPORTED bit untouched.
> > > > >
> > > > > errseq_check_and_advance must now handle a single special case where
> > > > > it races against a "peek" of an as of yet unseen value. The do/while
> > > > > loop looks scary, but shouldn't loop more than once.
> > > > >
> > > > > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
> > > > > ---
> > > > >  Documentation/core-api/errseq.rst |  22 +++--
> > > > >  include/linux/errseq.h            |   1 +
> > > > >  lib/errseq.c                      | 139 ++++++++++++++++++++++--------
> > > > >  3 files changed, 118 insertions(+), 44 deletions(-)
> > > > >
> > > > > v3: rename SEEN/MUSTINC flags to REPORTED/OBSERVED
> > > > >
> > > > > Hopefully the new flag names will make this a bit more clear. We could
> > > > > also rename some of the functions if that helps too. We could consider
> > > > > moving from errseq_sample/_check_and_advance to
> > > > > errseq_observe/errseq_report?  I'm not sure that helps anything though.
> > > > >
> > > > > I know that Vivek and Sargun are working on syncfs() for overlayfs, so
> > > > > we probably don't want to merge this until that work is ready. I think
> > > >
> > > > I disagree. I think that this work can land ahead of that, given that I think
> > > > this is probably backportable to v5.10 without much risk, with the addition of
> > > > your RFC v2 Overlay patch. I think the work proper long-term repair Vivek is
> > > > embarking upon seems like it may be far more invasive.
> > > >
> > > > > the errseq_peek call will need to be part of their solution for volatile
> > > > > mounts, however, so I'm fine with merging this via the overlayfs tree,
> > > > > once that work is complete.
> > > > >
> > > > > diff --git a/Documentation/core-api/errseq.rst b/Documentation/core-api/errseq.rst
> > > > > index ff332e272405..ce46ddcc1487 100644
> > > > > --- a/Documentation/core-api/errseq.rst
> > > > > +++ b/Documentation/core-api/errseq.rst
> > > > > @@ -18,18 +18,22 @@ these functions can be called from any context.
> > > > >  Note that there is a risk of collisions if new errors are being recorded
> > > > >  frequently, since we have so few bits to use as a counter.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > -To mitigate this, the bit between the error value and counter is used as
> > > > > -a flag to tell whether the value has been sampled since a new value was
> > > > > -recorded.  That allows us to avoid bumping the counter if no one has
> > > > > -sampled it since the last time an error was recorded.
> > > > > +To mitigate this, the bits between the error value and counter are used
> > > > > +as flags to tell whether the value has been sampled since a new value
> > > > > +was recorded, and whether the latest error has been seen by userland.
> > > > > +That allows us to avoid bumping the counter if no one has sampled it
> > > > > +since the last time an error was recorded, and also ensures that any
> > > > > +recorded error will be seen at least once.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >  Thus we end up with a value that looks something like this:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > -+--------------------------------------+----+------------------------+
> > > > > -| 31..13                               | 12 | 11..0                  |
> > > > > -+--------------------------------------+----+------------------------+
> > > > > -| counter                              | SF | errno                  |
> > > > > -+--------------------------------------+----+------------------------+
> > > > > ++---------------------------------+----+----+------------------------+
> > > > > +| 31..14                          | 13 | 12 | 11..0                  |
> > > > > ++---------------------------------+----+----+------------------------+
> > > > > +| counter                         | OF | RF | errno                  |
> > > > > ++---------------------------------+----+----+------------------------+
> > > > > +OF = ERRSEQ_OBSERVED flag
> > > > > +RF = ERRSEQ_REPORTED flag
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >  The general idea is for "watchers" to sample an errseq_t value and keep
> > > > >  it as a running cursor.  That value can later be used to tell whether
> > > > > diff --git a/include/linux/errseq.h b/include/linux/errseq.h
> > > > > index fc2777770768..7e3634269c95 100644
> > > > > --- a/include/linux/errseq.h
> > > > > +++ b/include/linux/errseq.h
> > > > > @@ -9,6 +9,7 @@ typedef u32     errseq_t;
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >  errseq_t errseq_set(errseq_t *eseq, int err);
> > > > >  errseq_t errseq_sample(errseq_t *eseq);
> > > > > +errseq_t errseq_peek(errseq_t *eseq);
> > > > >  int errseq_check(errseq_t *eseq, errseq_t since);
> > > > >  int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
> > > > >  #endif
> > > > > diff --git a/lib/errseq.c b/lib/errseq.c
> > > > > index 81f9e33aa7e7..8fd6be134dcc 100644
> > > > > --- a/lib/errseq.c
> > > > > +++ b/lib/errseq.c
> > > > > @@ -21,10 +21,14 @@
> > > > >   * Note that there is a risk of collisions if new errors are being recorded
> > > > >   * frequently, since we have so few bits to use as a counter.
> > > > >   *
> > > > > - * To mitigate this, one bit is used as a flag to tell whether the value has
> > > > > - * been sampled since a new value was recorded. That allows us to avoid bumping
> > > > > - * the counter if no one has sampled it since the last time an error was
> > > > > - * recorded.
> > > > > + * To mitigate this, one bit is used as a flag to tell whether the value has been
> > > > > + * observed in some fashion. That allows us to avoid bumping the counter if no
> > > > > + * one has sampled it since the last time an error was recorded.
> > > > > + *
> > > > > + * A second flag bit is used to indicate whether the latest error that has been
> > > > > + * recorded has been reported to userland. If the REPORTED bit is not set when the
> > > > > + * file is opened, then we ensure that the opener will see the error by setting
> > > > > + * its sample to 0.
> > > >
> > > > Since there are only a few places that report to userland (as far as I can tell,
> > > > a bit of usage in ceph), does it make sense to maintain this specific flag that
> > > > indicates it's reported to userspace? Instead can userspace keep a snapshot
> > > > of the last errseq it reported (say on the superblock), and use that to drive
> > > > reports to userspace?
> > > >
> > > > It's a 32-bit sacrifice per SB though, but it means we can get rid of
> > > > errseq_check_and_advance and potentially remove any need for locking and just
> > > > rely on cmpxchg.
> > >
> > > I think it makes sense. You are essentially adding a new class of
> > > "samplers" that use the error for their own purposes and won't be
> > > reporting it to userland via normal channels (syncfs, etc.). A single
> > > bit to indicate whether it has only been observed by such samplers is
> > > not a huge sacrifice.
> > >
> > > I worry too about race conditions when tracking this information across
> > > multiple words. You'll either need to use some locking to manage that,
> > > or get clever with memory barriers. Keeping everything in one word makes
> > > things a lot simpler.
> > > --
> > > Jeff Layton <jlayton@xxxxxxxxxx>
> > >
> >
> > I'll wait for Amir or Miklos to chime in, but I'm happy with this, and it solves
> > my problems.
> >
> > Do you want to respin this patch with the overlayfs patch as well, so
> > we can cherry-pick to stable, and then focus on how we want to deal
> > with this problem in the future?
>
> Assuming no one sees issues with it and that this solves the problem of
> writeback errors on volatile mounts, I'm fine with this going in via the
> overlayfs tree, just ahead of the patch that adds the first caller of
> errseq_peek.
>

I like the ERRSEQ_OBSERVED/ERRSEQ_REPORTED abstraction.
I agree with Jeff that ERRSEQ_SEEN wrongly multiplexies two
completely different things.

We've had to maintain backward compact to the syncfs() behavior
expected by existing users, but I can also imagine that fsinfo() would
want to check for sb error without consuming it, so errseq_peek()
looks like the right direction to take.

> I think we're finding that the thornier problem is how to pass along
> writeback errors on non-volatile mounts. That's probably going to
> require some vfs-layer surgery, so it may be best to wait until the
> shape of that is clear.

I have to say, following the thread of each of those problems is pretty
challenging. Following both issues in several intewinding threads is a
workout...

Thanks,
Amir.