On Wed, Jan 08, 2025 at 01:33:07PM -0300, Travis Downs wrote: > Hello linux-block, > > We are experiencing data corruption in our storage intensive server > application and are wondering about the semantics of "racy" O_DIRECT > writes. > > Normally we target XFS, but the question is a general one. > > Specifically, imagine that we are writing a single 4K aligned page, > with contents AB00 (each char being 1K bytes). We only care about > the first 2048 bytes (the AB part). We are using libaio writes > (io_submit) with O_DIRECT semantics. While the write is in flight, > i.e., > after we have submitted it and before we reap it in io_getevents, the > userspace application writes into second half of the page, > changing it to ABCD (let's say via memcpy). The first half is not changed. > > The question then is: is this safe in the sense that would result in > ABxx being written where xx "is don't care"? Or could it do something > crazier, like cause later writes to be ignored (e.g. if something in > the kernel storage layer hashes the page for some purpose and > this hash is out of sync with the page at the time it was captured, or > something like that). > > Of course, the easy answer is "don't do that", but I still want to > know what happens if we do. Don't do that. Really. First of all, your program might need to run on OS's other than Linux, such as Legacy Unix systems, Mac OS X, etc, and officially, there is zero guarantees about cache coherency between O_DIRECT writes and the page cache. So if you use O_DIRECT I/O and buffered I/O or mmap access to a file.... all bet are off. By definition O_DIRECT I/O bypasses the page cache, so if there is a copy of the file's data block in the page cache, for some implementations of some OS's the page cache might contain the previous stale version of the block, so buffer reads might not have the updated copy reflected by the O_DIRECT write. And if the page is mmap'ed into some process's address space, and the process dirties that page, that page will get written back to the disk, potentially overwriting O_DIRECT write. Linux will make best efforts to maintain cache coherency between O_DIRECT and the page cache. It does that by writing out the page in the page cache if it is dirty, and then evicting the the page from the page cache. In practice this will be good enough to keep programs like a database which locks the database so it can take a consistent snapshot, and then does the backup via buffered write, when the database normally uses O_DIRECT for its transactions, it will work --- since if the database wasn't locked while taking the backup, it would be completely a mess, and the O_DIRECT vs page cache coherency is the *least* of your worries. But in general, don't mix bufered/mmap and O_DIRECT I/O to the same file. Just don't. It might work, but remember that raison d'etre for O_DIRECT is performance in support of databases and storage systems where developers Know What They Are Doing(tm) and Follow The Rules(tm). Linux's cache coherency is best efforts only (and other OS's might not even bother), and database developers and sysadmins would be sad if we compromised O_DIRECT perforance just to make things 100% safe for people want to do things which are breaking the rules. If you like breaking rules, don't use O_DIRECT. You'll be happier for it, as will hapless future users of your programs. :-) Remember, good programs are maintainable and portable. What if some one attempts to take your programs and tries to make it work on MacOS? Cheers, - Ted P.S. I commend to you the ten commandments for C programmers, especially the last one. Remember, all the world's not Linux! https://www.lysator.liu.se/c/ten-commandments.html