Re: fsync() errors is unsafe and risks data loss

Jeff Layton <jlayton@xxxxxxxxxx> · Fri, 13 Apr 2018 09:18:56 -0400

On Fri, 2018-04-13 at 08:44 +1000, Dave Chinner wrote:
> On Thu, Apr 12, 2018 at 11:08:50AM -0400, Jeff Layton wrote:
> > On Thu, 2018-04-12 at 22:01 +1000, Dave Chinner wrote:
> > > On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:
> > > > When there is a writeback error, what should be done with the dirty
> > > > page(s)? Right now, we usually just mark them clean and carry on. Is
> > > > that the right thing to do?
> > > 
> > > There isn't a right thing. Whatever we do will be wrong for someone.
> > > 
> > > > One possibility would be to invalidate the range that failed to be
> > > > written (or the whole file) and force the pages to be faulted in again
> > > > on the next access. It could be surprising for some applications to not
> > > > see the results of their writes on a subsequent read after such an
> > > > event.
> > > 
> > > Not to mention a POSIX IO ordering violation. Seeing stale data
> > > after a "successful" write is simply not allowed.
> > 
> > I'm not so sure here, given that we're dealing with an error condition.
> > Are we really obligated not to allow any changes to pages that we can't
> > write back?
> 
> Posix says this about write():
> 
>   After a write() to a regular file has successfully returned:
> 
>      Any successful read() from each byte position in the file that
>      was modified by that write shall return the data specified by
>      the write() for that position until such byte positions are
>      again modified.
> 
> IOWs, even if there is a later error, we told the user the write was
> successful, and so according to POSIX we are not allowed to wind
> back the data to what it was before the write() occurred.
> 
> > Given that the pages are clean after these failures, we aren't doing
> > this even today:
> > 
> > Suppose we're unable to do writes but can do reads vs. the backing
> > store. After a wb failure, the page has the dirty bit cleared. If it
> > gets kicked out of the cache before the read occurs, it'll have to be
> > faulted back in. Poof -- your write just disappeared.
> 
> Yes - I was pointing out what the specification we supposedly
> conform to says about this behaviour, not that our current behaviour
> conforms to the spec.  Indeed, have you even noticed
> xfs_aops_discard_page() and it's surrounding context on page
> writeback submission errors?
> 
> To save you looking, XFS will trash the page contents completely on
> a filesystem level ->writepage error. It doesn't mark them "clean",
> doesn't attempt to redirty and rewrite them - it clears the uptodate
> state and may invalidate it completely. IOWs, the data written
> "sucessfully" to the cached page is now gone. It will be re-read
> from disk on the next read() call, in direct violation of the above
> POSIX requirements.
> 
> This is my point: we've done that in XFS knowing that we violate
> POSIX specifications in this specific corner case - it's the lesser
> of many evils we have to chose between. Hence if we chose to encode
> that behaviour as the general writeback IO error handling algorithm,
> then it needs to done with the knowledge it is a specification
> violation. Not to mention be documented as a POSIX violation in the
> various relevant man pages and that this is how all filesystems will
> behave on async writeback error.....
> 

Got it, thanks.

Yes, I think we ought to probably do the same thing globally. It's nice
to know that xfs has already been doing this. That makes me feel better
about making this behavior the gold standard for Linux filesystems.

So to summarize, at this point in the discussion, I think we want to
consider doing the following:

* better reporting from syncfs (report an error when even one inode
failed to be written back since last syncfs call). We'll probably
implement this via a per-sb errseq_t in some fashion, though there are
some implementation issues to work out.

* invalidate or clear uptodate flag on pages that experience writeback
errors, across filesystems. Encourage this as standard behavior for
filesystems and maybe add helpers to make it easier to do this.

Did I miss anything? Would that be enough to help the Pg usecase?

I don't see us ever being able to reasonably support its current
expectation that writeback errors will be seen on fd's that were opened
after the error occurred. That's a really thorny problem from an object
lifetime perspective.

-- 
Jeff Layton <jlayton@xxxxxxxxxx>