Re: Semantics of racy O_DIRECT writes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jan 08, 2025 at 01:33:07PM -0300, Travis Downs wrote:
> Hello linux-block,
> 
> We are experiencing data corruption in our storage intensive server
> application and are wondering about the semantics of "racy" O_DIRECT
> writes.
> 
> Normally we target XFS, but the question is a general one.
> 
> Specifically, imagine that we are writing a single 4K aligned page,
> with contents AB00 (each char being 1K bytes). We only care about
> the first 2048 bytes (the AB part). We are using libaio writes
> (io_submit) with O_DIRECT semantics. While the write is in flight,
> i.e.,
> after we have submitted it and before we reap it in io_getevents, the
> userspace application writes into second half of the page,
> changing it to ABCD (let's say via memcpy). The first half is not changed.
> 
> The question then is: is this safe in the sense that would result in
> ABxx being written where xx "is don't care"? Or could it do something
> crazier, like cause later writes to be ignored (e.g. if something in
> the kernel storage layer hashes the page for some purpose and
> this hash is out of sync with the page at the time it was captured, or
> something like that).
> 
> Of course, the easy answer is "don't do that", but I still want to
> know what happens if we do.

Don't do that.  Really.

First of all, your program might need to run on OS's other than Linux,
such as Legacy Unix systems, Mac OS X, etc, and officially, there is
zero guarantees about cache coherency between O_DIRECT writes and the
page cache.  So if you use O_DIRECT I/O and buffered I/O or mmap
access to a file.... all bet are off.

By definition O_DIRECT I/O bypasses the page cache, so if there is a
copy of the file's data block in the page cache, for some
implementations of some OS's the page cache might contain the previous
stale version of the block, so buffer reads might not have the updated
copy reflected by the O_DIRECT write.  And if the page is mmap'ed into
some process's address space, and the process dirties that page, that
page will get written back to the disk, potentially overwriting
O_DIRECT write.

Linux will make best efforts to maintain cache coherency between
O_DIRECT and the page cache.  It does that by writing out the page in
the page cache if it is dirty, and then evicting the the page from the
page cache.  In practice this will be good enough to keep programs
like a database which locks the database so it can take a consistent
snapshot, and then does the backup via buffered write, when the
database normally uses O_DIRECT for its transactions, it will work ---
since if the database wasn't locked while taking the backup, it would
be completely a mess, and the O_DIRECT vs page cache coherency is the
*least* of your worries.

But in general, don't mix bufered/mmap and O_DIRECT I/O to the same
file.  Just don't.  It might work, but remember that raison d'etre for
O_DIRECT is performance in support of databases and storage systems
where developers Know What They Are Doing(tm) and Follow The
Rules(tm).  Linux's cache coherency is best efforts only (and other
OS's might not even bother), and database developers and sysadmins
would be sad if we compromised O_DIRECT perforance just to make things
100% safe for people want to do things which are breaking the rules.

If you like breaking rules, don't use O_DIRECT.  You'll be happier for
it, as will hapless future users of your programs.  :-)

Remember, good programs are maintainable and portable.  What if some
one attempts to take your programs and tries to make it work on MacOS?

Cheers,

					- Ted

P.S.  I commend to you the ten commandments for C programmers,
especially the last one.  Remember, all the world's not Linux!
	   
      https://www.lysator.liu.se/c/ten-commandments.html




[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux