Re: [PATCH v3] errseq: split the ERRSEQ_SEEN flag into two new flags

Sargun Dhillon <sargun@xxxxxxxxx> · Thu, 17 Dec 2020 20:35:24 +0000

On Thu, Dec 17, 2020 at 10:00:37AM -0500, Jeff Layton wrote:
> Overlayfs's volatile mounts want to be able to sample an error for their
> own purposes, without preventing a later opener from potentially seeing
> the error.
> 
> The original reason for the ERRSEQ_SEEN flag was to make it so that we
> didn't need to increment the counter if nothing had observed the latest
> value and the error was the same. Eventually, a regression was reported
> in the errseq_t conversion, and we fixed that by using the ERRSEQ_SEEN
> flag to also mean that the error had been reported to userland at least
> once somewhere.
> 
> Those are two different states, however. If we instead take a second
> flag bit from the counter, we can track these two things separately, and
> accomodate the overlayfs volatile mount use-case.
> 
> Rename the ERRSEQ_SEEN flag to ERRSEQ_OBSERVED and use that to indicate
> that the counter must be incremented the next time an error is set.
> Also, add a new ERRSEQ_REPORTED flag that indicates whether the current
> error was returned to userland (and thus doesn't need to be reported on
> newly open file descriptions).
> 
> Test only for the OBSERVED bit when deciding whether to increment the
> counter and only for the REPORTED bit when deciding what to return in
> errseq_sample.
> 
> Add a new errseq_peek function to allow for the overlayfs use-case.
> This just grabs the latest counter and sets the OBSERVED bit, leaving the
> REPORTED bit untouched.
> 
> errseq_check_and_advance must now handle a single special case where
> it races against a "peek" of an as of yet unseen value. The do/while
> loop looks scary, but shouldn't loop more than once.
> 
> Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
> ---
>  Documentation/core-api/errseq.rst |  22 +++--
>  include/linux/errseq.h            |   1 +
>  lib/errseq.c                      | 139 ++++++++++++++++++++++--------
>  3 files changed, 118 insertions(+), 44 deletions(-)
> 
> v3: rename SEEN/MUSTINC flags to REPORTED/OBSERVED
> 
> Hopefully the new flag names will make this a bit more clear. We could
> also rename some of the functions if that helps too. We could consider
> moving from errseq_sample/_check_and_advance to
> errseq_observe/errseq_report?  I'm not sure that helps anything though.
> 
> I know that Vivek and Sargun are working on syncfs() for overlayfs, so
> we probably don't want to merge this until that work is ready. I think

I disagree. I think that this work can land ahead of that, given that I think 
this is probably backportable to v5.10 without much risk, with the addition of 
your RFC v2 Overlay patch. I think the work proper long-term repair Vivek is 
embarking upon seems like it may be far more invasive.

> the errseq_peek call will need to be part of their solution for volatile
> mounts, however, so I'm fine with merging this via the overlayfs tree,
> once that work is complete.
> 
> diff --git a/Documentation/core-api/errseq.rst b/Documentation/core-api/errseq.rst
> index ff332e272405..ce46ddcc1487 100644
> --- a/Documentation/core-api/errseq.rst
> +++ b/Documentation/core-api/errseq.rst
> @@ -18,18 +18,22 @@ these functions can be called from any context.
>  Note that there is a risk of collisions if new errors are being recorded
>  frequently, since we have so few bits to use as a counter.
>  
> -To mitigate this, the bit between the error value and counter is used as
> -a flag to tell whether the value has been sampled since a new value was
> -recorded.  That allows us to avoid bumping the counter if no one has
> -sampled it since the last time an error was recorded.
> +To mitigate this, the bits between the error value and counter are used
> +as flags to tell whether the value has been sampled since a new value
> +was recorded, and whether the latest error has been seen by userland.
> +That allows us to avoid bumping the counter if no one has sampled it
> +since the last time an error was recorded, and also ensures that any
> +recorded error will be seen at least once.
>  
>  Thus we end up with a value that looks something like this:
>  
> -+--------------------------------------+----+------------------------+
> -| 31..13                               | 12 | 11..0                  |
> -+--------------------------------------+----+------------------------+
> -| counter                              | SF | errno                  |
> -+--------------------------------------+----+------------------------+
> ++---------------------------------+----+----+------------------------+
> +| 31..14                          | 13 | 12 | 11..0                  |
> ++---------------------------------+----+----+------------------------+
> +| counter                         | OF | RF | errno                  |
> ++---------------------------------+----+----+------------------------+
> +OF = ERRSEQ_OBSERVED flag
> +RF = ERRSEQ_REPORTED flag
>  
>  The general idea is for "watchers" to sample an errseq_t value and keep
>  it as a running cursor.  That value can later be used to tell whether
> diff --git a/include/linux/errseq.h b/include/linux/errseq.h
> index fc2777770768..7e3634269c95 100644
> --- a/include/linux/errseq.h
> +++ b/include/linux/errseq.h
> @@ -9,6 +9,7 @@ typedef u32	errseq_t;
>  
>  errseq_t errseq_set(errseq_t *eseq, int err);
>  errseq_t errseq_sample(errseq_t *eseq);
> +errseq_t errseq_peek(errseq_t *eseq);
>  int errseq_check(errseq_t *eseq, errseq_t since);
>  int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
>  #endif
> diff --git a/lib/errseq.c b/lib/errseq.c
> index 81f9e33aa7e7..8fd6be134dcc 100644
> --- a/lib/errseq.c
> +++ b/lib/errseq.c
> @@ -21,10 +21,14 @@
>   * Note that there is a risk of collisions if new errors are being recorded
>   * frequently, since we have so few bits to use as a counter.
>   *
> - * To mitigate this, one bit is used as a flag to tell whether the value has
> - * been sampled since a new value was recorded. That allows us to avoid bumping
> - * the counter if no one has sampled it since the last time an error was
> - * recorded.
> + * To mitigate this, one bit is used as a flag to tell whether the value has been
> + * observed in some fashion. That allows us to avoid bumping the counter if no
> + * one has sampled it since the last time an error was recorded.
> + *
> + * A second flag bit is used to indicate whether the latest error that has been
> + * recorded has been reported to userland. If the REPORTED bit is not set when the
> + * file is opened, then we ensure that the opener will see the error by setting
> + * its sample to 0.

Since there are only a few places that report to userland (as far as I can tell, 
a bit of usage in ceph), does it make sense to maintain this specific flag that
indicates it's reported to userspace? Instead can userspace keep a snapshot
of the last errseq it reported (say on the superblock), and use that to drive
reports to userspace?

It's a 32-bit sacrifice per SB though, but it means we can get rid of 
errseq_check_and_advance and potentially remove any need for locking and just
rely on cmpxchg.

>   *
>   * A new errseq_t should always be zeroed out.  A errseq_t value of all zeroes
>   * is the special (but common) case where there has never been an error. An all
> @@ -35,11 +39,33 @@
>  /* The low bits are designated for error code (max of MAX_ERRNO) */
>  #define ERRSEQ_SHIFT		ilog2(MAX_ERRNO + 1)
>  
> -/* This bit is used as a flag to indicate whether the value has been seen */
> -#define ERRSEQ_SEEN		(1 << ERRSEQ_SHIFT)
> +/* Flag to indicate whether the value will be or has been reported */
> +#define ERRSEQ_REPORTED		BIT(ERRSEQ_SHIFT)
> +
> +/* Flag to ndicate that error must be recorded */
> +#define ERRSEQ_OBSERVED		BIT(ERRSEQ_SHIFT + 1)
>  
>  /* The lowest bit of the counter */
> -#define ERRSEQ_CTR_INC		(1 << (ERRSEQ_SHIFT + 1))
> +#define ERRSEQ_CTR_INC		BIT(ERRSEQ_SHIFT + 2)
> +
> +/* Mask that just contains the counter bits */
> +#define ERRSEQ_CTR_MASK		~(ERRSEQ_CTR_INC - 1)
> +
> +/* Mask that just contains flags */
> +#define ERRSEQ_FLAG_MASK	(ERRSEQ_REPORTED|ERRSEQ_OBSERVED)
> +
> +/**
> + * errseq_same - return true if the errseq counters and values are the same
> + * @a: first errseq
> + * @b: second errseq
> + *
> + * Compare two errseqs and return true if they are the same, ignoring their
> + * flag bits.
> + */
> +static inline bool errseq_same(errseq_t a, errseq_t b)
> +{
> +	return (a & ~ERRSEQ_FLAG_MASK) == (b & ~ERRSEQ_FLAG_MASK);
> +}
>  
>  /**
>   * errseq_set - set a errseq_t for later reporting
> @@ -53,7 +79,7 @@
>   *
>   * Return: The previous value, primarily for debugging purposes. The
>   * return value should not be used as a previously sampled value in later
> - * calls as it will not have the SEEN flag set.
> + * calls as it will not have the OBSERVED flag set.
>   */
>  errseq_t errseq_set(errseq_t *eseq, int err)
>  {
> @@ -77,11 +103,11 @@ errseq_t errseq_set(errseq_t *eseq, int err)
>  	for (;;) {
>  		errseq_t new;
>  
> -		/* Clear out error bits and set new error */
> -		new = (old & ~(MAX_ERRNO|ERRSEQ_SEEN)) | -err;
> +		/* Clear out flag bits and old errors, and set new error */
> +		new = (old & ERRSEQ_CTR_MASK) | -err;
>  
> -		/* Only increment if someone has looked at it */
> -		if (old & ERRSEQ_SEEN)
> +		/* Only increment if we have to */
> +		if (old & ERRSEQ_OBSERVED)
>  			new += ERRSEQ_CTR_INC;
>  
>  		/* If there would be no change, then call it done */
> @@ -108,11 +134,38 @@ errseq_t errseq_set(errseq_t *eseq, int err)
>  EXPORT_SYMBOL(errseq_set);
>  
>  /**
> - * errseq_sample() - Grab current errseq_t value.
> + * errseq_peek - Grab current errseq_t value
> + * @eseq: Pointer to errseq_t to be sampled.
> + *
> + * In some cases, we need to be able to sample the errseq_t, but we're not
> + * in a situation where we can report the value to userland. Use this
> + * function to do that. This ensures that later errors will be recorded,
> + * and that any current errors are reported at least once when it is
> + * next sampled.
I would add that this function has side effects, and mutates errseq_t,
so callers understand they're mutating data.

> + *
> + * Context: Any context.
> + * Return: The current errseq value.
> + */
> +errseq_t errseq_peek(errseq_t *eseq)
> +{
> +	errseq_t old = READ_ONCE(*eseq);
> +	errseq_t new = old;
> +
> +	if (old != 0) {
> +		new |= ERRSEQ_OBSERVED;
> +		if (old != new)
> +			cmpxchg(eseq, old, new);
> +	}
> +	return new;
> +}
> +EXPORT_SYMBOL(errseq_peek);
> +
> +/**
> + * errseq_sample() - Sample errseq_t value, and ensure that unseen errors are reported
>   * @eseq: Pointer to errseq_t to be sampled.
>   *
>   * This function allows callers to initialise their errseq_t variable.
> - * If the error has been "seen", new callers will not see an old error.
> + * If the latest error has been "seen", new callers will not see an old error.
>   * If there is an unseen error in @eseq, the caller of this function will
>   * see it the next time it checks for an error.
>   *
> @@ -121,12 +174,11 @@ EXPORT_SYMBOL(errseq_set);
>   */
>  errseq_t errseq_sample(errseq_t *eseq)
>  {
> -	errseq_t old = READ_ONCE(*eseq);
> +	errseq_t new = errseq_peek(eseq);
>  
> -	/* If nobody has seen this error yet, then we can be the first. */
> -	if (!(old & ERRSEQ_SEEN))
> -		old = 0;
> -	return old;
> +	if (!(new & ERRSEQ_REPORTED))
> +		return 0;
> +	return new;
>  }
>  EXPORT_SYMBOL(errseq_sample);
>  
> @@ -145,7 +197,7 @@ int errseq_check(errseq_t *eseq, errseq_t since)
>  {
>  	errseq_t cur = READ_ONCE(*eseq);
>  
> -	if (likely(cur == since))
> +	if (errseq_same(cur, since))
>  		return 0;
>  	return -(cur & MAX_ERRNO);
>  }
> @@ -159,9 +211,9 @@ EXPORT_SYMBOL(errseq_check);
>   * Grab the eseq value, and see whether it matches the value that @since
>   * points to. If it does, then just return 0.
>   *
> - * If it doesn't, then the value has changed. Set the "seen" flag, and try to
> - * swap it into place as the new eseq value. Then, set that value as the new
> - * "since" value, and return whatever the error portion is set to.
> + * If it doesn't, then the value has changed. Set the REPORTED+OBSERVED flags, and
> + * try to swap it into place as the new eseq value. Then, set that value as
> + * the new "since" value, and return whatever the error portion is set to.
>   *
>   * Note that no locking is provided here for concurrent updates to the "since"
>   * value. The caller must provide that if necessary. Because of this, callers
> @@ -183,21 +235,38 @@ int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
>  	 */
>  	old = READ_ONCE(*eseq);
>  	if (old != *since) {
> +		int loops = 0;
> +
>  		/*
> -		 * Set the flag and try to swap it into place if it has
> -		 * changed.
> +		 * Set the flag and try to swap it into place if it has changed.
> +		 *
> +		 * If the swap doesn't occur, then it has either been updated by a
> +		 * writer who is setting a new error and/or bumping the counter, or
> +		 * another reader who is setting flags.
>  		 *
> -		 * We don't care about the outcome of the swap here. If the
> -		 * swap doesn't occur, then it has either been updated by a
> -		 * writer who is altering the value in some way (updating
> -		 * counter or resetting the error), or another reader who is
> -		 * just setting the "seen" flag. Either outcome is OK, and we
> -		 * can advance "since" and return an error based on what we
> -		 * have.
> +		 * We only need to retry in one case -- if we raced with another
> +		 * reader that is only setting the OBSERVED flag. We need the
> +		 * current value to have the REPORTED bit set if the other fields
> +		 * didn't change, or we might report the same error on newly opened
> +		 * files.
>  		 */
> -		new = old | ERRSEQ_SEEN;
> -		if (new != old)
> -			cmpxchg(eseq, old, new);
> +		do {
> +			if (unlikely(loops >= 2)) {
Any reason not to just make this:

if (WARN_ON_ONCE(loops >= 2))
	break

Given WARN_ON_ONCE already does the unlikely.

> +				/*
> +				 * This should never loop more than once, as any
> +				 * change not involving the REPORTED bit would also
> +				 * involve non-flag bits. WARN and just go with
> +				 * what we have in that case.
Couldn't we get a race of errors were being incremented faster than this loop runs?
> +				 */
> +				WARN_ON_ONCE(true);
> +				break;
> +			}
> +			loops++;
> +			new = old | ERRSEQ_REPORTED | ERRSEQ_OBSERVED;
> +			if (new == old)
> +				break;
> +			old = cmpxchg(eseq, old, new);
> +		} while (old == (new & ~ERRSEQ_REPORTED));
>  		*since = new;
>  		err = -(new & MAX_ERRNO);
>  	}
> -- 
> 2.29.2
> 

Thanks,

I'd like to see if we can get this in, and a janky version of the ovl_syncfs 
changes for volatile in stable, and then close the loop on the work that Vivek 
is doing to fix this for all of VFS in a later release.

Do you think that's reasonable?