Re: RFC: Clarifying Direct I/O Semantics

Lawrence Greenfield <leg@xxxxxxxxxx> · Sat, 22 Aug 2009 09:25:20 -0400

On Fri, Aug 21, 2009 at 8:07 PM, Theodore Tso<tytso@xxxxxxx> wrote:
> On Fri, Aug 21, 2009 at 06:28:53PM -0400, jim owens wrote:
>>> The Linux man page does not state what happens if the alignment
>>> restrictions are not met; does the kernel start running rogue or
>>> nethack; does it send a signal such as SIGSEGV or SIGABORT, and kill the
>>> running process; or does it fall back to buffered I/O? Today, the answer
>>> is the latter; but it's not specified anywhere.
>>
>> retval = -EINVAL; is what __blockdev_direct_IO does in that case
>> and what I was making btrfs directIO do.  but fall back is OK too
>> if we really want. what existing code fixes up the EINVAL?
>
> You're right; I thought it did the fallback in all cases, but it only
> does it when writing into holes.  Oops.  I should have tested this
> before saying it.
>
> I'll fix up the wiki page.

I think failing when O_DIRECT can't be honored is the right thing.
Applications can't verify O_DIRECT behavior, so it's important to tell
an application that the kernel can't do what they're asking for.

>
>>> This is relatively well understood by most implementors and users of
>>> O_DIRECT as part of the "oral lore", so simply updating the Linux man
>>> page should not be controversial.
>>>
>>
>> The following section includes "sparse" AKA "allocating" writes but
>> just says "extending".  Either sparse-filling write needs covered
>> separately or we should say "allocating" instead of "extending.
>
> Yup, good point.
>
>> Possibly it should just be stated that directIO write data integrity
>> is based on the setting of posix O_SYNC and O_DSYNC.  Then it is their
>> choice to run slow-and-safe or fast.  O_SYNC requires metadata on disk.
>
> The question in my mind is whether we should guarantee that the data
> block is written synchronously for allocating writes when the file
> metadata is not written synchronously; what's the point?  After all,
> the application can't distinguish between the data block not making it
> out to disk, versus the metadata that will allow the data block to be
> accessed after a crash, why should one by synchronous but not the
> other?

O_DIRECT is about avoiding polluting the buffer cache, not only about
data integrity. If an application wants allocating writes to have a
data integrity guarantee, they can open the file O_DIRECT|O_DSYNC, at
the cost that writes they think might be one disk seek end up being 2
(or more). But please don't fall back to putting the data into the
buffer cache!

I think it would be useful to be explicit to applications what they
need to do for O_DIRECT writes to be guaranteed to be visible after a
crash. As a naive application writer, I would have thought using
posix_fallocate would have been "good enough". If I understand
correctly, an application that wants to know that O_DIRECT writes will
both avoid the buffer cache and be visible after a crash must
guarantee that it's previously written to those blocks either O_DSYNC
or has used fdatasync() on the file after such writes. All subsequent
writes can be done with only O_DIRECT.

That means that a database must explicitly initialize its files by
writing 0s: it can't rely on posix_fallocate. (Amusingly, it would
have worked before fallocate() was introduced into the kernel!)

Larry

>
>                                                - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html