On Fri, Jan 10, 2025 at 5:58 AM Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote: > > On Thu, Jan 09, 2025 at 10:51:19AM -0500, Theodore Ts'o wrote: > > For Linux, if the block device is one that requires stable writes > > (e.g., for iSCSI writes which include a checksum, or SCSI devices with > > DIF/DIX enabled, or some software RAID 5 block device), where a racy > > write might lead to an I/O error on the write or in the case of RAID > > 5, in the subsequent read of the block, Linux will protect against > > this happening by marking the page read-only while the I/O is > > underway, either if it's happening via buffered writeback or O_DIRECT > > writes, and then marking the page read/write afterwards. > > This only happens for buffered I/O, and not for direct I/O. Thank you. To clarify, "this" means the RO protection, right? So in direct IO there is no such protection? > > But that only matters when your operation is inside the sector (LBA) > boundary that the device interface operates on, e.g. if you using 512 > byte sector size as long your stay outside of that you're still fine. Sorry it's not clear if you are talking about the buffered or direct I/O case here. Also, my problem description was not clear enough. I made it sound as if we only concurrently write to different 1k blocks than the data we care about, but there is actually no such alignment: we might write to adjacent bytes of alignment 1. That is, we may write bytes [0, 777) of some 4K block, then send it down for direct IO via io_submit, and before that returns we may write the next region [777, 1234) or whatever. So we are definitely interested in the case where there are writes within the same 512-byte sector. > > BUT: that assumes device checksums. File systems can have checksums > as well and have the same problem. Because of that for example running > Windows VM images which tend to somehow generate this pattern on qemu > using direct I/O on btrfs files has historically causes a lot of > problems. So is it fair to say that for direct IO these types of racy writes are not safe? Specifically, we are looking at behavior in a 3rd party, proprietary block device (implemented as a kernel module) and are wondering if these types of racy writes break the implied or explicit semantics of safe direct IO writes.