Hi, On 2018-04-12 17:21:44 -0400, Theodore Y. Ts'o wrote: > On Thu, Apr 12, 2018 at 01:28:30PM -0700, Matthew Wilcox wrote: > > On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote: > > > I think a per-file or even per-blockdev/fs error state that'd be > > > returned by fsync() would be more than sufficient. > > > > Ah; this was my suggestion to Jeff on IRC. That we add a per-superblock > > wb_err and then allow syncfs() to return it. So you'd open an fd on > > a directory (for example), and call syncfs() which would return -EIO > > or -ENOSPC if either of those conditions had occurred since you opened > > the fd. > > When or how would the per-superblock wb_err flag get cleared? I don't think unmount + resettable via /sys would be an insane approach. Requiring explicit action to acknowledge data loss isn't a crazy concept. But I think that's something reasonable minds could disagree with. > Would all subsequent fsync() calls on that file system now return EIO? > Or would only all subsequent syncfs() calls return EIO? If it were tied to syncfs, I wonder if there's a way to have some errseq type logic. Store a per superblock (or whatever equivalent thing) errseq value of errors. For each fd calling syncfs() report the error once, but then store the current value in a separate per-fd field. And if that's considered too weird, only report the errors to fds that have been opened from before the error occurred. I can see writing a tool 'pg_run_and_sync /directo /ries -- command' which opens an fd for each of the filesystems the directories reside on, and calls syncfs() after. That'd allow to use backup/restore tools at least semi safely. > > > I don't see that > > > that'd realistically would trigger OOM or the inability to unmount a > > > filesystem. > > > > Ted's referring to the current state of affairs where the writeback error > > is held in the inode; if we can't evict the inode because it's holding > > the error indicator, that can send us OOM. If instead we transfer the > > error indicator to the superblock, then there's no problem. > > Actually, I was referring to the pg-hackers original ask, which was > that after an error, all of the dirty pages that couldn't be written > out would stay dirty. Well, it's an open list, everyone can argue. And initially people at first didn't know the OOM explanation, and then it takes some time to revise ones priors :). I think it's a design question that reasonable people can disagree upon (if "hot" removed devices are handled by throwing data away regardless, at least). But as it's clearly not something viable, we can move on to something that can solve the problem. > If it's only as single inode which is pinned in memory with the dirty > flag, that's bad, but it's not as bad as pinning all of the memory > pages for which there was a failed write. We would still need to > invent some mechanism or define some semantic when it would be OK to > clear the per-inode flag and let the memory associated with that > pinned inode get released, though. Yea, I agree that that's not obvious. One way would be to say that it's only automatically cleared when you unlink the file. A bit heavyhanded, but not too crazy. Greetings, Andres Freund