On Wed, Feb 28, 2024 at 8:13 AM Theodore Ts'o <tytso@xxxxxxx> wrote: > > Last year, I talked about an interest to provide database such as > MySQL with the ability to issue writes that would not be torn as they > write 16k database pages[1]. > > [1] https://lwn.net/Articles/932900/ > > There is a patch set being worked on by John Garry which provides > stronger guarantees than what is actually required for this use case, > called "atomic writes". The proposed interface for this facility > involves passing a new flag to pwritev2(2), RWF_ATOMIC, which requests > that the specific write be written to the storage device in an > all-or-nothing fashion, and if it can not be guaranteed, that the > write should fail. In this interface, if the userspace sends an 128k > write with the RWF_ATOMIC flag, if the storage device will support > that an all-or-nothing write with the given size and alignment the > kernel will guarantee that it will be sent as a single 128k request > --- although from the database perspective, if it is using 16k > database pages, it only needs to guarantee that if the write is torn, > it only happen on a 16k boundary. That is, if the write is split into > 32k and 96k request, that would be totally fine as far as the database > is concerned --- and so the RWF_ATOMIC interface is a stronger > guarantee than what might be needed. > > So far, the "atomic write" patchset has only focused on Direct I/O, > where this stronger guarantee is mostly harmless, even if it is > unneeded for the original motivating use case. Which might be OK, > since perhaps there might be other future use cases where they might > want some 32k writes to be "atomic", while other 128k writes might > want to be "atomic" (that is to say, persisted with all-or-nothing > semantics), and the proposed RWF_ATOMIC interface might permit that > --- even though no one can seem top come up with a credible use case > that would require this. > > > However, this proposed interface is highly problematic when it comes > to buffered writes, and Postgress database uses buffered, not direct > I/O writes. Suppose the database performs a 16k write, followed by a > 64k write, followed by a 128k write --- and these writes are done > using a file descriptor that does not have O_DIRECT enable, and let's > suppose they are written using the proposed RWF_ATOMIC flag. In > order to provide the (stronger than we need) RWF_ATOMIC guarantee, the > kernel would need to store the fact that certain pages in the page > cache were dirtied as part of a 16k RWF_ATOMIC write, and other pages > were dirtied as part of a 32k RWF_ATOMIC write, etc, so that the > writeback code knows what the "atomic" guarantee that was made at > write time. This very quickly becomes a mess. > > Another interface that one be much simpler to implement for buffered > writes would be one the untorn write granularity is set on a per-file > descriptor basis, using fcntl(2). We validate whether the untorn > write granularity is one that can be supported when fcntl(2) is > called, and we also store in the inode the largest untorn write > granularity that has been requested by a file descriptor for that > inode. (When the last file descriptor opened for writing has been > closed, the largest untorn write granularity for that inode can be set > back down to zero.) > > The write(2) system call will check whether the size and alignment of > the write are valid given the requested untorn write granularity. And > in the writeback path, the writeback will detect if there are > contiguous (aligned) dirty pages, and make sure they are sent to the > storage device in multiples of the largest requested untorn write > granularity. This provides only the guarantees required by databases, > and obviates the need to track which pages were dirtied by an > RWF_ATOMIC flag, and the size of the RWF_ATOMIC write. > > I'd like to discuss at LSF/MM what the best interface would be for > buffered, untorn writes (I am deliberately avoiding the use of the > word "atomic" since that presumes stronger guarantees than what we > need, and because it has led to confusion in previous discussions), > and what might be needed to support it. > Seems a duplicate of this topic proposed by Luis? https://lore.kernel.org/linux-fsdevel/ZdfDxN26VOFaT_Tv@xxxxxxxxxxxxxxxxxxxxxx/ Maybe you guys want to co-lead this session? Thanks, Amir.