Re: fsync() errors is unsafe and risks data loss

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 13 Apr 2018 08:44:04 +1000

On Thu, Apr 12, 2018 at 11:08:50AM -0400, Jeff Layton wrote:
> On Thu, 2018-04-12 at 22:01 +1000, Dave Chinner wrote:
> > On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:
> > > When there is a writeback error, what should be done with the dirty
> > > page(s)? Right now, we usually just mark them clean and carry on. Is
> > > that the right thing to do?
> > 
> > There isn't a right thing. Whatever we do will be wrong for someone.
> > 
> > > One possibility would be to invalidate the range that failed to be
> > > written (or the whole file) and force the pages to be faulted in again
> > > on the next access. It could be surprising for some applications to not
> > > see the results of their writes on a subsequent read after such an
> > > event.
> > 
> > Not to mention a POSIX IO ordering violation. Seeing stale data
> > after a "successful" write is simply not allowed.
> 
> I'm not so sure here, given that we're dealing with an error condition.
> Are we really obligated not to allow any changes to pages that we can't
> write back?

Posix says this about write():

  After a write() to a regular file has successfully returned:

     Any successful read() from each byte position in the file that
     was modified by that write shall return the data specified by
     the write() for that position until such byte positions are
     again modified.

IOWs, even if there is a later error, we told the user the write was
successful, and so according to POSIX we are not allowed to wind
back the data to what it was before the write() occurred.

> Given that the pages are clean after these failures, we aren't doing
> this even today:
> 
> Suppose we're unable to do writes but can do reads vs. the backing
> store. After a wb failure, the page has the dirty bit cleared. If it
> gets kicked out of the cache before the read occurs, it'll have to be
> faulted back in. Poof -- your write just disappeared.

Yes - I was pointing out what the specification we supposedly
conform to says about this behaviour, not that our current behaviour
conforms to the spec.  Indeed, have you even noticed
xfs_aops_discard_page() and it's surrounding context on page
writeback submission errors?

To save you looking, XFS will trash the page contents completely on
a filesystem level ->writepage error. It doesn't mark them "clean",
doesn't attempt to redirty and rewrite them - it clears the uptodate
state and may invalidate it completely. IOWs, the data written
"sucessfully" to the cached page is now gone. It will be re-read
from disk on the next read() call, in direct violation of the above
POSIX requirements.

This is my point: we've done that in XFS knowing that we violate
POSIX specifications in this specific corner case - it's the lesser
of many evils we have to chose between. Hence if we chose to encode
that behaviour as the general writeback IO error handling algorithm,
then it needs to done with the knowledge it is a specification
violation. Not to mention be documented as a POSIX violation in the
various relevant man pages and that this is how all filesystems will
behave on async writeback error.....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx