(Sorry if I screwed up the thread structure - I'd to reconstruct the reply-to and CC list from web archive as I've not found a way to properly download an mbox or such of old content. Was subscribed to fsdevel but not ext4 lists) Hi, 2018-04-10 18:43:56 Ted wrote: > I'll try to give as unbiased a description as possible, but certainly > some of this is going to be filtered by my own biases no matter how > careful I can be. Same ;) 2018-04-10 18:43:56 Ted wrote: > So for better or for worse, there has not been as much investment in > buffered I/O and data robustness in the face of exception handling of > storage devices. That's a bit of a cop out. It's not just databases that care. Even more basic tools like SCM, package managers and editors care whether they can proper responses back from fsync that imply things actually were synced. 2018-04-10 18:43:56 Ted wrote: > So this is the explanation for why Linux handles I/O errors by > clearing the dirty bit after reporting the error up to user space. > And why there is not eagerness to solve the problem simply by "don't > clear the dirty bit". For every one Postgres installation that might > have a better recover after an I/O error, there's probably a thousand > clueless Fedora and Ubuntu users who will have a much worse user > experience after a USB stick pull happens. I don't think these necessarily are as contradictory goals as you paint them. At least in postgres' case we can deal with the fact that an fsync retry isn't going to fix the problem by reentering crash recovery or just shutting down - therefore we don't need to keep all the dirty buffers around. A per-inode or per-superblock bit that causes further fsyncs to fail would be entirely sufficent for that. While there's some differing opinions on the referenced postgres thread, the fundamental problem isn't so much that a retry won't fix the problem, it's that we might NEVER see the failure. If writeback happens in the background, encounters an error, undirties the buffer, we will happily carry on because we've never seen that. That's when we're majorly screwed. Both in postgres, *and* a lot of other applications, it's not at all guaranteed to consistently have one FD open for every file writtten. Therefore even the more recent per-fd errseq logic doesn't guarantee that the failure will ever be seen by an application diligently fsync()ing. You'd not even need to have per inode information or such in the case that the block device goes away entirely. As the FS isn't generally unmounted in that case, you could trivially keep a per-mount (or superblock?) bit that says "I died" and set that instead of keeping per inode/whatever information. 2018-04-10 18:43:56 Ted wrote: > If you are aware of a company who is willing to pay to have a new > kernel feature implemented to meet your needs, we might be able to > refer you to a company or a consultant who might be able to do that > work. I find it a bit dissapointing response. I think it's fair to say that for advanced features, but we're talking about the basic guarantee that fsync actually does something even remotely reasonable. 2018-04-10 19:44:48 Andreas wrote: > The confusion is whether fsync() is a "level" state (return error > forever if there were pages that could not be written), or an "edge" > state (return error only for any write failures since the previous > fsync() call). I don't think that's the full issue. We can deal with the fact that an fsync failure is edge-triggered if there's a guarantee that every process doing so would get it. The fact that one needs to have an FD open from before any failing writes occurred to get a failure, *THAT'S* the big issue. Beyond postgres, it's a pretty common approach to do work on a lot of files without fsyncing, then iterate over the directory fsync everything, and *then* assume you're safe. But unless I severaly misunderstand something that'd only be safe if you kept an FD for every file open, which isn't realistic for pretty obvious reasons. 2018-04-10 18:43:56 Ted wrote: > I think Anthony Iliopoulos was pretty clear in his multiple > descriptions in that thread of why the current behaviour is needed > (OOM of the whole system if dirty pages are kept around forever), but > many others were stuck on "I can't believe this is happening??? This > is totally unacceptable and every kernel needs to change to match my > expectations!!!" without looking at the larger picture of what is > practical to change and where the issue should best be fixed. Everone can participate in discussions... Greetings, Andres Freund