Re: [PATCH v2 0/3] Check errors on sync for volatile overlayfs mounts

Jeff Layton <jlayton@xxxxxxxxxx> · Sat, 12 Dec 2020 06:48:07 -0500

On Sat, 2020-12-12 at 06:21 -0500, Jeff Layton wrote:
> On Fri, 2020-12-11 at 15:49 -0800, Sargun Dhillon wrote:
> > The semantics of errseq and syncfs are such that it is impossible to track
> > if any errors have occurred between the time the first error occurred, and
> > the user checks for the error (calls syncfs, and subsequently
> > errseq_check_and_advance.
> > 
> > Overlayfs has a volatile feature which short-circuits syncfs. This, in turn
> > makes it so that the user can have silent data corruption and not know
> > about it. The third patch in the series introduces behaviour that makes it
> > so that we can track errors, and bubble up whether the user has put
> > themselves in bad situation.
> > 
> > This required some gymanstics in errseq, and adding a wrapper around it
> > called "errseq_counter" (errseq + counter). The data structure uses an
> > atomic to track overflow errors. This approach, rather than moving to an
> > atomic64 / u64 is so we can avoid bloating every person that subscribes to
> > an errseq, and only add the subscriber behaviour to those who care (at the
> > expense of space.
> > 
> > The datastructure is write-optimized, and rightfully so, as the users
> > of the counter feature are just overlayfs, and it's called in fsync
> > checking, which is a rather seldom operation, and not really on
> > any hotpaths.
> > 
> > [1]: https://lore.kernel.org/linux-fsdevel/20201202092720.41522-1-sargun@xxxxxxxxx/
> > 
> > Sargun Dhillon (3):
> >   errseq: Add errseq_counter to allow for all errors to be observed
> >   errseq: Add mechanism to snapshot errseq_counter and check snapshot
> >   overlay: Implement volatile-specific fsync error behaviour
> > 
> >  Documentation/filesystems/overlayfs.rst |   8 ++
> >  fs/buffer.c                             |   2 +-
> >  fs/overlayfs/file.c                     |   5 +-
> >  fs/overlayfs/overlayfs.h                |   1 +
> >  fs/overlayfs/ovl_entry.h                |   3 +
> >  fs/overlayfs/readdir.c                  |   5 +-
> >  fs/overlayfs/super.c                    |  26 +++--
> >  fs/overlayfs/util.c                     |  28 +++++
> >  fs/super.c                              |   1 +
> >  fs/sync.c                               |   3 +-
> >  include/linux/errseq.h                  |  18 ++++
> >  include/linux/fs.h                      |   6 +-
> >  include/linux/pagemap.h                 |   2 +-
> >  lib/errseq.c                            | 129 ++++++++++++++++++++----
> >  14 files changed, 202 insertions(+), 35 deletions(-)
> > 
> 
> It would hel if you could more clearly lay out the semantics you're
> looking for. If I understand correctly:
> 
> You basically want to be able to sample the sb->s_wb_err of the upper
> layer at mount time and then always return an error if any new errors
> were recorded since that point.
> 
> If that's correct, then I'm not sure I get need for all of this extra
> counter machinery. Why not just sample it at mount time without
> recording it as 0 if the seen flag isn't set. Then just do an
> errseq_check against the upper superblock (without advancing) in the
> overlayfs ->sync_fs routine and just errseq_set that error into the
> overlayfs superblock? The syncfs syscall wrapper should then always
> report the latest error.
> 
> Or (even better) rework all of the sync_fs/syncfs mess to be more sane,
> so that overlayfs has more control over what errors get returned to
> userland. ISTM that the main problem you have is that the
> errseq_check_and_advance is done in the syscall wrapper, and that's
> probably not appropriate for your use-case.
> 

Something like this is what I was thinking (completely untested, of
course and needs comments). Would this not give you what you want for
volatile mount syncfs semantics?

diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index 1b5a2094df8e..b7ada3fd04fd 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -79,6 +79,7 @@ struct ovl_fs {
 	atomic_long_t last_ino;
 	/* Whiteout dentry cache */
 	struct dentry *whiteout;
+	errseq_t err;
 };
 
 static inline struct vfsmount *ovl_upper_mnt(struct ovl_fs *ofs)
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 290983bcfbb3..35780776360c 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -264,8 +264,12 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
 	if (!ovl_upper_mnt(ofs))
 		return 0;
 
-	if (!ovl_should_sync(ofs))
-		return 0;
+	if (!ovl_should_sync(ofs)) {
+		ret = errseq_check(&upper_sb->s_wb_err, ofs->err);
+		errseq_set(&sb->s_wb_err, ret);
+		return ret;
+	}
+
 	/*
 	 * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
 	 * All the super blocks will be iterated, including upper_sb.
@@ -1945,7 +1949,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 
 		sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth;
 		sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran;
-
+		ofs->err = errseq_sample(&ovl_upper_mnt(ofs)->mnt_sb->s_wb_err);
 	}
 	oe = ovl_get_lowerstack(sb, splitlower, numlower, ofs, layers);
 	err = PTR_ERR(oe);
diff --git a/include/linux/errseq.h b/include/linux/errseq.h
index fc2777770768..0d9ead687dc1 100644
--- a/include/linux/errseq.h
+++ b/include/linux/errseq.h
@@ -9,6 +9,7 @@ typedef u32	errseq_t;
 
 errseq_t errseq_set(errseq_t *eseq, int err);
 errseq_t errseq_sample(errseq_t *eseq);
+errseq_t errseq_sample_advance(errseq_t *eseq);
 int errseq_check(errseq_t *eseq, errseq_t since);
 int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
 #endif
diff --git a/lib/errseq.c b/lib/errseq.c
index 81f9e33aa7e7..27ab3000507f 100644
--- a/lib/errseq.c
+++ b/lib/errseq.c
@@ -130,6 +130,25 @@ errseq_t errseq_sample(errseq_t *eseq)
 }
 EXPORT_SYMBOL(errseq_sample);
 
+errseq_t errseq_sample_advance(errseq_t *eseq)
+{
+	errseq_t old = READ_ONCE(*eseq);
+	errseq_t new = old;
+
+	/*
+	 * For the common case of no errors ever having been set, we can skip
+	 * marking the SEEN bit. Once an error has been set, the value will
+	 * never go back to zero.
+	 */
+	if (old != 0) {
+		new |= ERRSEQ_SEEN;
+		if (old != new)
+			cmpxchg(eseq, old, new);
+	}
+	return new;
+}
+EXPORT_SYMBOL(errseq_sample_advance);
+
 /**
  * errseq_check() - Has an error occurred since a particular sample point?
  * @eseq: Pointer to errseq_t value to be checked.