On Fri, Aug 21, 2009 at 8:07 PM, Theodore Tso<tytso@xxxxxxx> wrote: > On Fri, Aug 21, 2009 at 06:28:53PM -0400, jim owens wrote: >>> The Linux man page does not state what happens if the alignment >>> restrictions are not met; does the kernel start running rogue or >>> nethack; does it send a signal such as SIGSEGV or SIGABORT, and kill the >>> running process; or does it fall back to buffered I/O? Today, the answer >>> is the latter; but it's not specified anywhere. >> >> retval = -EINVAL; is what __blockdev_direct_IO does in that case >> and what I was making btrfs directIO do. but fall back is OK too >> if we really want. what existing code fixes up the EINVAL? > > You're right; I thought it did the fallback in all cases, but it only > does it when writing into holes. Oops. I should have tested this > before saying it. > > I'll fix up the wiki page. I think failing when O_DIRECT can't be honored is the right thing. Applications can't verify O_DIRECT behavior, so it's important to tell an application that the kernel can't do what they're asking for. > >>> This is relatively well understood by most implementors and users of >>> O_DIRECT as part of the "oral lore", so simply updating the Linux man >>> page should not be controversial. >>> >> >> The following section includes "sparse" AKA "allocating" writes but >> just says "extending". Either sparse-filling write needs covered >> separately or we should say "allocating" instead of "extending. > > Yup, good point. > >> Possibly it should just be stated that directIO write data integrity >> is based on the setting of posix O_SYNC and O_DSYNC. Then it is their >> choice to run slow-and-safe or fast. O_SYNC requires metadata on disk. > > The question in my mind is whether we should guarantee that the data > block is written synchronously for allocating writes when the file > metadata is not written synchronously; what's the point? After all, > the application can't distinguish between the data block not making it > out to disk, versus the metadata that will allow the data block to be > accessed after a crash, why should one by synchronous but not the > other? O_DIRECT is about avoiding polluting the buffer cache, not only about data integrity. If an application wants allocating writes to have a data integrity guarantee, they can open the file O_DIRECT|O_DSYNC, at the cost that writes they think might be one disk seek end up being 2 (or more). But please don't fall back to putting the data into the buffer cache! I think it would be useful to be explicit to applications what they need to do for O_DIRECT writes to be guaranteed to be visible after a crash. As a naive application writer, I would have thought using posix_fallocate would have been "good enough". If I understand correctly, an application that wants to know that O_DIRECT writes will both avoid the buffer cache and be visible after a crash must guarantee that it's previously written to those blocks either O_DSYNC or has used fdatasync() on the file after such writes. All subsequent writes can be done with only O_DIRECT. That means that a database must explicitly initialize its files by writing 0s: it can't rely on posix_fallocate. (Amusingly, it would have worked before fallocate() was introduced into the kernel!) Larry > > - Ted > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html