Re: Semantics of racy O_DIRECT writes

Travis Downs <travis.downs@xxxxxxxxx> · Fri, 31 Jan 2025 17:06:50 -0300

On Fri, Jan 10, 2025 at 5:58 AM Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
>
> On Thu, Jan 09, 2025 at 10:51:19AM -0500, Theodore Ts'o wrote:
> > For Linux, if the block device is one that requires stable writes
> > (e.g., for iSCSI writes which include a checksum, or SCSI devices with
> > DIF/DIX enabled, or some software RAID 5 block device), where a racy
> > write might lead to an I/O error on the write or in the case of RAID
> > 5, in the subsequent read of the block, Linux will protect against
> > this happening by marking the page read-only while the I/O is
> > underway, either if it's happening via buffered writeback or O_DIRECT
> > writes, and then marking the page read/write afterwards.
>
> This only happens for buffered I/O, and not for direct I/O.

Thank you. To clarify, "this" means the RO protection, right? So in direct IO
there is no such protection?

>
> But that only matters when your operation is inside the sector (LBA)
> boundary that the device interface operates on, e.g. if you using 512
> byte sector size as long your stay outside of that you're still fine.

Sorry it's not clear if you are talking about the buffered or direct
I/O case here.

Also, my problem description was not clear enough. I made it sound as if
we only concurrently write to different 1k blocks than the data we care about,
but there is actually no such alignment: we might write to adjacent bytes of
alignment 1.

That is, we may write bytes [0, 777) of some 4K block, then send it down for
direct IO via io_submit, and before that returns we may write the next region
[777, 1234) or whatever. So we are definitely interested in the case where
there are writes within the same 512-byte sector.

>
> BUT: that assumes device checksums.  File systems can have checksums
> as well and have the same problem.  Because of that for example running
> Windows VM images which tend to somehow generate this pattern on qemu
> using direct I/O on btrfs files has historically causes a lot of
> problems.

So is it fair to say that for direct IO these types of racy writes are not safe?

Specifically, we are looking at behavior in a 3rd party, proprietary
block device
(implemented as a kernel module) and are wondering if these types of racy
writes break the implied or explicit semantics of safe direct IO writes.