Re: fsync() errors is unsafe and risks data loss

Andres Freund <andres@xxxxxxxxxxx> · Thu, 12 Apr 2018 14:37:56 -0700

Hi,

On 2018-04-12 17:21:44 -0400, Theodore Y. Ts'o wrote:
> On Thu, Apr 12, 2018 at 01:28:30PM -0700, Matthew Wilcox wrote:
> > On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:
> > > I think a per-file or even per-blockdev/fs error state that'd be
> > > returned by fsync() would be more than sufficient.
> > 
> > Ah; this was my suggestion to Jeff on IRC.  That we add a per-superblock
> > wb_err and then allow syncfs() to return it.  So you'd open an fd on
> > a directory (for example), and call syncfs() which would return -EIO
> > or -ENOSPC if either of those conditions had occurred since you opened
> > the fd.
> 
> When or how would the per-superblock wb_err flag get cleared?

I don't think unmount + resettable via /sys would be an insane
approach. Requiring explicit action to acknowledge data loss isn't a
crazy concept.  But I think that's something reasonable minds could
disagree with.

> Would all subsequent fsync() calls on that file system now return EIO?
> Or would only all subsequent syncfs() calls return EIO?

If it were tied to syncfs, I wonder if there's a way to have some errseq
type logic. Store a per superblock (or whatever equivalent thing) errseq
value of errors.  For each fd calling syncfs() report the error once,
but then store the current value in a separate per-fd field.  And if
that's considered too weird, only report the errors to fds that have
been opened from before the error occurred.

I can see writing a tool 'pg_run_and_sync /directo /ries -- command'
which opens an fd for each of the filesystems the directories reside on,
and calls syncfs() after.  That'd allow to use backup/restore tools at
least semi safely.

> > >  I don't see that
> > > that'd realistically would trigger OOM or the inability to unmount a
> > > filesystem.
> > 
> > Ted's referring to the current state of affairs where the writeback error
> > is held in the inode; if we can't evict the inode because it's holding
> > the error indicator, that can send us OOM.  If instead we transfer the
> > error indicator to the superblock, then there's no problem.
> 
> Actually, I was referring to the pg-hackers original ask, which was
> that after an error, all of the dirty pages that couldn't be written
> out would stay dirty.

Well, it's an open list, everyone can argue. And initially people at
first didn't know the OOM explanation, and then it takes some time to
revise ones priors :).  I think it's a design question that reasonable
people can disagree upon (if "hot" removed devices are handled by
throwing data away regardless, at least).  But as it's clearly not
something viable, we can move on to something that can solve the
problem.

> If it's only as single inode which is pinned in memory with the dirty
> flag, that's bad, but it's not as bad as pinning all of the memory
> pages for which there was a failed write.  We would still need to
> invent some mechanism or define some semantic when it would be OK to
> clear the per-inode flag and let the memory associated with that
> pinned inode get released, though.

Yea, I agree that that's not obvious. One way would be to say that it's
only automatically cleared when you unlink the file. A bit heavyhanded,
but not too crazy.

Greetings,

Andres Freund