On Apr 10, 2018, at 10:50 AM, Joshua D. Drake <jd@xxxxxxxxxxxxxxxxx> wrote: > > -ext4, > > If this is not the appropriate list please point me in the right direction. I am a PostgreSQL contributor and we have come across a reliability problem with writes and fsync(). You can see the thread here: > > https://www.postgresql.org/message-id/flat/20180401002038.GA2211%40paquier.xyz#20180401002038.GA2211@xxxxxxxxxxx > > The tl;dr; in the first message doesn't quite describe the problem as we started to dig into it further. Yes, this is a very long thread. The summary is Postgres is unhappy that fsync() on Linux (and also other OSes) returns an error once if there was a prior write() failure, instead of keeping dirty pages in memory forever and trying to rewrite them. This behaviour has existed on Linux forever, and (for better or worse) is the only reasonable behaviour that the kernel can take. I've argued for the opposite behaviour at times, and some subsystems already do limited retries before finally giving up on a failed write, though there are also times when retrying at lower levels is pointless if a higher level of code can handle the failure (e.g. mirrored block devices, filesystem data mirroring, userspace data mirroring, or cross-node replication). The confusion is whether fsync() is a "level" state (return error forever if there were pages that could not be written), or an "edge" state (return error only for any write failures since the previous fsync() call). I think Anthony Iliopoulos was pretty clear in his multiple descriptions in that thread of why the current behaviour is needed (OOM of the whole system if dirty pages are kept around forever), but many others were stuck on "I can't believe this is happening??? This is totally unacceptable and every kernel needs to change to match my expectations!!!" without looking at the larger picture of what is practical to change and where the issue should best be fixed. Regardless of why this is the case, the net is that PG needs to deal with all of the systems that currently exist that have this behaviour, even if some day in the future it may change (though that is unlikely). It seems ironic that "keep dirty pages in userspace until fsync() returns success" is totally unacceptable, but "keep dirty pages in the kernel" is fine. My (limited) understanding of databases was that they preferred to cache everything in userspace and use O_DIRECT to write to disk (which returns an error immediately if the write fails and does not double buffer data). Cheers, Andreas
Attachment:
signature.asc
Description: Message signed with OpenPGP