On Sat, Aug 22, 2009 at 09:25:20AM -0400, Lawrence Greenfield wrote: > > The question in my mind is whether we should guarantee that the data > > block is written synchronously for allocating writes when the file > > metadata is not written synchronously; what's the point? After all, > > the application can't distinguish between the data block not making it > > out to disk, versus the metadata that will allow the data block to be > > accessed after a crash, why should one by synchronous but not the > > other? > > O_DIRECT is about avoiding polluting the buffer cache, not only about > data integrity. If an application wants allocating writes to have a > data integrity guarantee, they can open the file O_DIRECT|O_DSYNC, at > the cost that writes they think might be one disk seek end up being 2 > (or more). But please don't fall back to putting the data into the > buffer cache! Well, it really depends on who you talk to. This goes back to the problem that O_DIRECT's goals and semantics aren't well defined. I find it really hard to believe that the main point is to avoid polluting the page/buffer cache. If that were true, then fadvise's FADV_NOREUSE would be sufficient, and much simpler semantics to implement than O_DIRECT's rather baroque restrictions and requirements. For the enterprise database folks (who were the ones who originally asked the Solaris, AIX, and Irix OS's of the world for this feature) it was always about performance/speed; they wanted to avoid copying data in and out of the buffer/page cache for speed reasons. But if you need to take time out to maniulate allocation data structures, the disk reads/writes are in the noise compared to the memory copy in and out of the buffer cache. > I think it would be useful to be explicit to applications what they > need to do for O_DIRECT writes to be guaranteed to be visible after a > crash. As a naive application writer, I would have thought using > posix_fallocate would have been "good enough". If I understand > correctly, an application that wants to know that O_DIRECT writes will > both avoid the buffer cache and be visible after a crash must > guarantee that it's previously written to those blocks either O_DSYNC > or has used fdatasync() on the file after such writes. All subsequent > writes can be done with only O_DIRECT. > > That means that a database must explicitly initialize its files by > writing 0s: it can't rely on posix_fallocate. (Amusingly, it would > have worked before fallocate() was introduced into the kernel!) Well, all a database needs to do is use fdatasync() after an application-level commit. If there hasn't been any metadata changes, the fdatasync() is cheap. If the application is keeping track of when it might be doing an allocating write() and when it isn't, it can try to work out when it can omit the fdatasync() call. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html