Re: Semantics of racy O_DIRECT writes

"Theodore Ts'o" <tytso@xxxxxxx> · Thu, 9 Jan 2025 10:51:19 -0500

On Thu, Jan 09, 2025 at 11:16:41AM -0300, Travis Downs wrote:
>  - So the question is large about the possible outcomes of doing a zero-
>    copy O_DIRECT write (where the block driver will ultimately be reading
>    directly from the pages allocated by and passed to the kernel by the
>    userspace application) in the situation where a portion of the the passed
>    pages are modified in a racy way by the userspace application by a
>    subsequent O_DIRECT write.

Yeah, sorry, I thought "modified via memcpy() was via a memcpy to a
mmap'ed region", which would mean it's in the page cache.  If what you
mean one thread modifying a block while the O_DIRECT write is
underway, the answer is "it depends".  For non-Linux systems, it will
almost certainly be racy.

For Linux, if the block device is one that requires stable writes
(e.g., for iSCSI writes which include a checksum, or SCSI devices with
DIF/DIX enabled, or some software RAID 5 block device), where a racy
write might lead to an I/O error on the write or in the case of RAID
5, in the subsequent read of the block, Linux will protect against
this happening by marking the page read-only while the I/O is
underway, either if it's happening via buffered writeback or O_DIRECT
writes, and then marking the page read/write afterwards.  Doing this
has performance implications, since changing the page table and the
need to do global interprocessor interupts is not free.  So we only do
it for those block devices that require stable writes, and even if you
are interested in a Linux-only answer, it's still "it depends".

Cheers,

					- Ted