Last year, I talked about an interest to provide database such as MySQL with the ability to issue writes that would not be torn as they write 16k database pages[1]. [1] https://lwn.net/Articles/932900/ There is a patch set being worked on by John Garry which provides stronger guarantees than what is actually required for this use case, called "atomic writes". The proposed interface for this facility involves passing a new flag to pwritev2(2), RWF_ATOMIC, which requests that the specific write be written to the storage device in an all-or-nothing fashion, and if it can not be guaranteed, that the write should fail. In this interface, if the userspace sends an 128k write with the RWF_ATOMIC flag, if the storage device will support that an all-or-nothing write with the given size and alignment the kernel will guarantee that it will be sent as a single 128k request --- although from the database perspective, if it is using 16k database pages, it only needs to guarantee that if the write is torn, it only happen on a 16k boundary. That is, if the write is split into 32k and 96k request, that would be totally fine as far as the database is concerned --- and so the RWF_ATOMIC interface is a stronger guarantee than what might be needed. So far, the "atomic write" patchset has only focused on Direct I/O, where this stronger guarantee is mostly harmless, even if it is unneeded for the original motivating use case. Which might be OK, since perhaps there might be other future use cases where they might want some 32k writes to be "atomic", while other 128k writes might want to be "atomic" (that is to say, persisted with all-or-nothing semantics), and the proposed RWF_ATOMIC interface might permit that --- even though no one can seem top come up with a credible use case that would require this. However, this proposed interface is highly problematic when it comes to buffered writes, and Postgress database uses buffered, not direct I/O writes. Suppose the database performs a 16k write, followed by a 64k write, followed by a 128k write --- and these writes are done using a file descriptor that does not have O_DIRECT enable, and let's suppose they are written using the proposed RWF_ATOMIC flag. In order to provide the (stronger than we need) RWF_ATOMIC guarantee, the kernel would need to store the fact that certain pages in the page cache were dirtied as part of a 16k RWF_ATOMIC write, and other pages were dirtied as part of a 32k RWF_ATOMIC write, etc, so that the writeback code knows what the "atomic" guarantee that was made at write time. This very quickly becomes a mess. Another interface that one be much simpler to implement for buffered writes would be one the untorn write granularity is set on a per-file descriptor basis, using fcntl(2). We validate whether the untorn write granularity is one that can be supported when fcntl(2) is called, and we also store in the inode the largest untorn write granularity that has been requested by a file descriptor for that inode. (When the last file descriptor opened for writing has been closed, the largest untorn write granularity for that inode can be set back down to zero.) The write(2) system call will check whether the size and alignment of the write are valid given the requested untorn write granularity. And in the writeback path, the writeback will detect if there are contiguous (aligned) dirty pages, and make sure they are sent to the storage device in multiples of the largest requested untorn write granularity. This provides only the guarantees required by databases, and obviates the need to track which pages were dirtied by an RWF_ATOMIC flag, and the size of the RWF_ATOMIC write. I'd like to discuss at LSF/MM what the best interface would be for buffered, untorn writes (I am deliberately avoiding the use of the word "atomic" since that presumes stronger guarantees than what we need, and because it has led to confusion in previous discussions), and what might be needed to support it. - Ted