Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

Amir Goldstein <amir73il@xxxxxxxxx> · Thu, 24 Dec 2020 11:32:55 +0200

On Wed, Dec 23, 2020 at 10:44 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> On Wed, Dec 23, 2020 at 08:21:41PM +0000, Sargun Dhillon wrote:
> > On Wed, Dec 23, 2020 at 08:07:46PM +0000, Matthew Wilcox wrote:
> > > On Wed, Dec 23, 2020 at 07:29:41PM +0000, Sargun Dhillon wrote:
> > > > On Wed, Dec 23, 2020 at 06:50:44PM +0000, Matthew Wilcox wrote:
> > > > > On Wed, Dec 23, 2020 at 06:20:27PM +0000, Sargun Dhillon wrote:
> > > > > > I fail to see why this is neccessary if you incorporate error reporting into the
> > > > > > sync_fs callback. Why is this separate from that callback? If you pickup Jeff's
> > > > > > patch that adds the 2nd flag to errseq for "observed", you should be able to
> > > > > > stash the first errseq seen in the ovl_fs struct, and do the check-and-return
> > > > > > in there instead instead of adding this new infrastructure.
> > > > >
> > > > > You still haven't explained why you want to add the "observed" flag.
> > > >
> > > >
> > > > In the overlayfs model, many users may be using the same filesystem (super block)
> > > > for their upperdir. Let's say you have something like this:
> > > >
> > > > /workdir [Mounted FS]
> > > > /workdir/upperdir1 [overlayfs upperdir]
> > > > /workdir/upperdir2 [overlayfs upperdir]
> > > > /workdir/userscratchspace
> > > >
> > > > The user needs to be able to do something like:
> > > > sync -f ${overlayfs1}/file
> > > >
> > > > which in turn will call sync on the the underlying filesystem (the one mounted
> > > > on /workdir), and can check if the errseq has changed since the overlayfs was
> > > > mounted, and use that to return an error to the user.
> > >
> > > OK, but I don't see why the current scheme doesn't work for this.  If
> > > (each instance of) overlayfs samples the errseq at mount time and then
> > > check_and_advances it at sync time, it will see any error that has occurred
> > > since the mount happened (and possibly also an error which occurred before
> > > the mount happened, but hadn't been reported to anybody before).
> > >
> >
> > If there is an outstanding error at mount time, and the SEEN flag is unset,
> > subsequent errors will not increment the counter, until the user calls sync on
> > the upperdir's filesystem. If overlayfs calls check_and_advance on the upperdir's
> > super block at any point, it will then set the seen block, and if the user calls
> > syncfs on the upperdir, it will not return that there is an outstanding error,
> > since overlayfs just cleared it.
>
> Your concern is this case:
>
> fs is mounted on /workdir
> /workdir/A is written to and then closed.
> writeback happens and -EIO happens, but there's nobody around to care.
> /workdir/upperdir1 becomes part of an overlayfs mount
> overlayfs samples the error
> a user writes to /workdir/B, another -EIO occurs, but nothing happens
> someone calls syncfs on /workdir/upperdir/A, gets the EIO.
> a user opens /workdir/B and calls syncfs, but sees no error
>
> do i have that right?  or is it something else?

IMO it is something else. Others may disagree.
IMO the level of interference between users accessing overlay and users
accessing upper fs directly is not well defined and it can stay this way.

Concurrent access to  /workdir/upperdir/A via overlay and underlying fs
is explicitly warranted against in Documentation/filesystems/overlayfs.rst#
Changes to underlying filesystems:
"Changes to the underlying filesystems while part of a mounted overlay
filesystem are not allowed.  If the underlying filesystem is changed,
the behavior of the overlay is undefined, though it will not result in
a crash or deadlock."

The question is whether syncfs(open(/workdir/B)) is considered
"Changes to the underlying filesystems". Regardless of the answer,
this is not an interesting case IMO.

The real issue is with interference between overlays that share the
same upper fs, because this is by far and large the common use case
that is creating real problems for a lot of container users.

Workloads running inside containers (with overlayfs storage driver)
will never be as isolated as workloads running inside VMs, but it
doesn't mean we cannot try to improve.

In current master, syncfs() on any file by any container user will
result in full syncfs() of the upperfs, which is very bad for container
isolation. This has been partly fixed by Chengguang Xu [1] and I expect
his work will be merged soon. Overlayfs still does not do the writeback
and syncfs() in overlay still waits for all upper fs writeback to complete,
but at least syncfs() in overlay only kicks writeback for upper fs files
dirtied by this overlay.

[1] https://lore.kernel.org/linux-unionfs/CAJfpegsbb4iTxW8ZyuRFVNc63zg7Ku7vzpSNuzHASYZH-d5wWA@xxxxxxxxxxxxxx/

Sharing the same SEEN flag among thousands of containers is also
far from ideal, because effectively this means that any given workload
in any single container has very little chance of observing the SEEN flag.

To this end, I do agree with Matthew that overlayfs should sample errseq
and the best patchset to implement it so far IMO is Jeff's patchset [2].
This patch set was written to cater only "volatile" overlayfs mount, but
there is no reason not to use the same mechanism for regular overlay
mount. The only difference being that "volatile" overlay only checks for
error since mount on syncfs() (because "volatile" overlay does NOT
syncfs upper fs) and regular overlay checks and advances the overlay's
errseq sample on syncfs (and does syncfs upper fs).

Matthew, I hope that my explanation of the use case and Jeff's answer
is sufficient to understand why the split of the SEEN flag is needed.

[2] https://lore.kernel.org/linux-unionfs/20201213132713.66864-1-jlayton@xxxxxxxxxx/

w.r.t Vivek's patchset (this one), I do not object to it at all, but it fixes
a problem that Jeff's patch had already solved with an ugly hack:

  /* Propagate errors from upper to overlayfs */
  ret = errseq_check(&upper_sb->s_wb_err, ofs->err_mark);
  errseq_set(&sb->s_wb_err, ret);

Since Jeff's patch is minimal, I think that it should be the fix applied
first and proposed for stable (with adaptations for non-volatile overlay).

I guess that Vivek's patch 1/3 from this series [3] is also needed to
complement the work that should go to stable.

Vivek, Sargun,

Do you understand my proposal?
Do you agree with it as a way forward to address the various syncfs
issues for volatile/non-volatile that both of you were trying to address?

Sargun, I know this all discussion has forked from your volatile re-use
patch set, but let's not confuse fsdevel forks more than we have to.
The way forward for volatile re-use from this proposal is straight forward.

Thanks,
Amir.