Re: Semantics of racy O_DIRECT writes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jan 9, 2025 at 1:57 AM Theodore Ts'o <tytso@xxxxxxx> wrote:
>
> Don't do that.  Really.
>
> First of all, your program might need to run on OS's other than Linux,
> such as Legacy Unix systems, Mac OS X, etc, and officially, there is
> zero guarantees about cache coherency between O_DIRECT writes and the
> page cache.  So if you use O_DIRECT I/O and buffered I/O or mmap
> access to a file.... all bet are off.

Thanks Theodore for your comprehensive reply. I probably was not very
clear in the way I posed my question. To clarify:

 - There is only one process involved here making all the writes
 - We do only O_DIRECT reads and writes, so I don't expect the page
   cache to be involved in the usual case (but we can't exclude it entirely).
 - So the question is large about the possible outcomes of doing a zero-
   copy O_DIRECT write (where the block driver will ultimately be reading
   directly from the pages allocated by and passed to the kernel by the
   userspace application) in the situation where a portion of the the passed
   pages are modified in a racy way by the userspace application by a
   subsequent O_DIRECT write.

> By definition O_DIRECT I/O bypasses the page cache, so if there is a
> copy of the file's data block in the page cache, for some
> implementations of some OS's the page cache might contain the previous
> stale version of the block, so buffer reads might not have the updated
> copy reflected by the O_DIRECT write.  And if the page is mmap'ed into
> some process's address space, and the process dirties that page, that
> page will get written back to the disk, potentially overwriting
> O_DIRECT write.
>
> Linux will make best efforts to maintain cache coherency between
> O_DIRECT and the page cache.  It does that by writing out the page in
> the page cache if it is dirty, and then evicting the the page from the
> page cache.  In practice this will be good enough to keep programs
> like a database which locks the database so it can take a consistent
> snapshot, and then does the backup via buffered write, when the
> database normally uses O_DIRECT for its transactions, it will work ---
> since if the database wasn't locked while taking the backup, it would
> be completely a mess, and the O_DIRECT vs page cache coherency is the
> *least* of your worries.

Note that we run only on Linux and are heavily tied to the details of linux
AIO and io_uring, so an "Linux only" response is fine. I am quite sure that
after an O_DIRECT write completes, a subsequent read through any
Linux API is going to return the newly written value, not a stale value from
the page cache.

>
> But in general, don't mix bufered/mmap and O_DIRECT I/O to the same
> file.  Just don't.  It might work, but remember that raison d'etre for
> O_DIRECT is performance in support of databases and storage systems
> where developers Know What They Are Doing(tm) and Follow The
> Rules(tm).  Linux's cache coherency is best efforts only (and other
> OS's might not even bother), and database developers and sysadmins
> would be sad if we compromised O_DIRECT perforance just to make things
> 100% safe for people want to do things which are breaking the rules.

This is us, we know what we are doing and are writing a database-like
product. We are heavy users of AIO and in fact many of the discussions
of AIO and O_DIRECT behavior here on the LKML and elsewhere are
driven by users of the same framework we use (seastar), so you can
consider us expert users from that point of view.

>
> If you like breaking rules, don't use O_DIRECT.  You'll be happier for
> it, as will hapless future users of your programs.  :-)
>
> Remember, good programs are maintainable and portable.  What if some
> one attempts to take your programs and tries to make it work on MacOS?

Fair enough. In our case, we are writing a high-performance, clustered event
store (Redpanda) which is a piece of infrastructure with very little demand
to run on anything other than Linux, except for "dev" scenarios, where
emulation is suitable. We make heavy use of aio (later, io_uring) and tune
for specific kernel features like RWF_NOWAIT, etc.

Thanks again,
Travis





[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux