On Thu, Dec 24, 2020 at 11:32:55AM +0200, Amir Goldstein wrote: > On Wed, Dec 23, 2020 at 10:44 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > > On Wed, Dec 23, 2020 at 08:21:41PM +0000, Sargun Dhillon wrote: > > > On Wed, Dec 23, 2020 at 08:07:46PM +0000, Matthew Wilcox wrote: > > > > On Wed, Dec 23, 2020 at 07:29:41PM +0000, Sargun Dhillon wrote: > > > > > On Wed, Dec 23, 2020 at 06:50:44PM +0000, Matthew Wilcox wrote: > > > > > > On Wed, Dec 23, 2020 at 06:20:27PM +0000, Sargun Dhillon wrote: > > > > > > > I fail to see why this is neccessary if you incorporate error reporting into the > > > > > > > sync_fs callback. Why is this separate from that callback? If you pickup Jeff's > > > > > > > patch that adds the 2nd flag to errseq for "observed", you should be able to > > > > > > > stash the first errseq seen in the ovl_fs struct, and do the check-and-return > > > > > > > in there instead instead of adding this new infrastructure. > > > > > > > > > > > > You still haven't explained why you want to add the "observed" flag. > > > > > > > > > > > > > > > In the overlayfs model, many users may be using the same filesystem (super block) > > > > > for their upperdir. Let's say you have something like this: > > > > > > > > > > /workdir [Mounted FS] > > > > > /workdir/upperdir1 [overlayfs upperdir] > > > > > /workdir/upperdir2 [overlayfs upperdir] > > > > > /workdir/userscratchspace > > > > > > > > > > The user needs to be able to do something like: > > > > > sync -f ${overlayfs1}/file > > > > > > > > > > which in turn will call sync on the the underlying filesystem (the one mounted > > > > > on /workdir), and can check if the errseq has changed since the overlayfs was > > > > > mounted, and use that to return an error to the user. > > > > > > > > OK, but I don't see why the current scheme doesn't work for this. If > > > > (each instance of) overlayfs samples the errseq at mount time and then > > > > check_and_advances it at sync time, it will see any error that has occurred > > > > since the mount happened (and possibly also an error which occurred before > > > > the mount happened, but hadn't been reported to anybody before). > > > > > > > > > > If there is an outstanding error at mount time, and the SEEN flag is unset, > > > subsequent errors will not increment the counter, until the user calls sync on > > > the upperdir's filesystem. If overlayfs calls check_and_advance on the upperdir's > > > super block at any point, it will then set the seen block, and if the user calls > > > syncfs on the upperdir, it will not return that there is an outstanding error, > > > since overlayfs just cleared it. > > > > Your concern is this case: > > > > fs is mounted on /workdir > > /workdir/A is written to and then closed. > > writeback happens and -EIO happens, but there's nobody around to care. > > /workdir/upperdir1 becomes part of an overlayfs mount > > overlayfs samples the error > > a user writes to /workdir/B, another -EIO occurs, but nothing happens > > someone calls syncfs on /workdir/upperdir/A, gets the EIO. > > a user opens /workdir/B and calls syncfs, but sees no error > > > > do i have that right? or is it something else? > > IMO it is something else. Others may disagree. > IMO the level of interference between users accessing overlay and users > accessing upper fs directly is not well defined and it can stay this way. > > Concurrent access to /workdir/upperdir/A via overlay and underlying fs > is explicitly warranted against in Documentation/filesystems/overlayfs.rst# > Changes to underlying filesystems: > "Changes to the underlying filesystems while part of a mounted overlay > filesystem are not allowed. If the underlying filesystem is changed, > the behavior of the overlay is undefined, though it will not result in > a crash or deadlock." > > The question is whether syncfs(open(/workdir/B)) is considered > "Changes to the underlying filesystems". Regardless of the answer, > this is not an interesting case IMO. > > The real issue is with interference between overlays that share the > same upper fs, because this is by far and large the common use case > that is creating real problems for a lot of container users. > > Workloads running inside containers (with overlayfs storage driver) > will never be as isolated as workloads running inside VMs, but it > doesn't mean we cannot try to improve. > > In current master, syncfs() on any file by any container user will > result in full syncfs() of the upperfs, which is very bad for container > isolation. This has been partly fixed by Chengguang Xu [1] and I expect > his work will be merged soon. Overlayfs still does not do the writeback > and syncfs() in overlay still waits for all upper fs writeback to complete, > but at least syncfs() in overlay only kicks writeback for upper fs files > dirtied by this overlay. > > [1] https://lore.kernel.org/linux-unionfs/CAJfpegsbb4iTxW8ZyuRFVNc63zg7Ku7vzpSNuzHASYZH-d5wWA@xxxxxxxxxxxxxx/ > > Sharing the same SEEN flag among thousands of containers is also > far from ideal, because effectively this means that any given workload > in any single container has very little chance of observing the SEEN flag. > > To this end, I do agree with Matthew that overlayfs should sample errseq > and the best patchset to implement it so far IMO is Jeff's patchset [2]. > This patch set was written to cater only "volatile" overlayfs mount, but > there is no reason not to use the same mechanism for regular overlay > mount. The only difference being that "volatile" overlay only checks for > error since mount on syncfs() (because "volatile" overlay does NOT > syncfs upper fs) and regular overlay checks and advances the overlay's > errseq sample on syncfs (and does syncfs upper fs). > > Matthew, I hope that my explanation of the use case and Jeff's answer > is sufficient to understand why the split of the SEEN flag is needed. > > [2] https://lore.kernel.org/linux-unionfs/20201213132713.66864-1-jlayton@xxxxxxxxxx/ > > w.r.t Vivek's patchset (this one), I do not object to it at all, but it fixes > a problem that Jeff's patch had already solved with an ugly hack: > > /* Propagate errors from upper to overlayfs */ > ret = errseq_check(&upper_sb->s_wb_err, ofs->err_mark); > errseq_set(&sb->s_wb_err, ret); > > Since Jeff's patch is minimal, I think that it should be the fix applied > first and proposed for stable (with adaptations for non-volatile overlay). > > I guess that Vivek's patch 1/3 from this series [3] is also needed to > complement the work that should go to stable. > > Vivek, Sargun, > > Do you understand my proposal? Yes. I agree that Jeff's patch should be added to stable. The fact we don't bubble up writeback errors turns out to be a real problem that I never knew was happening, but upon investigating, it looks like a real thing. I think we can use Jeff's hacky approach in stable, as it's far more minimal, and has a much lower chance of causing issues, but if we make further improvements, we wont be able to backport them to stable as easily. I have nothing explicitly against Vivek's approach though. > Do you agree with it as a way forward to address the various syncfs > issues for volatile/non-volatile that both of you were trying to address? Yes. I think Vivek's patchset of introducing a new superblock callback is the best approach. > > Sargun, I know this all discussion has forked from your volatile re-use > patch set, but let's not confuse fsdevel forks more than we have to. > The way forward for volatile re-use from this proposal is straight forward. I think that Vivek's patchset of adding a new callback, plus Jeff's new flag solves detection of errors for normal mounts, volatile mounts, and volatile remounts. I think I responded to Jeff's patch and it looked good, bar one of the loops. My only suggestion is that someone add the intended behaviour here as comment to super_ops of the new callback. > > Thanks, > Amir.