Sorry for the late reply. On Wed, Mar 20, 2013 at 10:45:23AM -0400, Theodore Ts'o wrote: > On Wed, Mar 20, 2013 at 09:14:42AM -0500, Eric Sandeen wrote: > > > > As an aside, is there any reason to have "dioread_nolock" as an option > > at this point? If it works now, would you ever *not* want it? > > > > (granted it doesn't work with some journaling options etc, but that > > behavior could be automatic, w/o the need for special mount options). > > The primary restriction is that diread_nolock doesn't work when fs > block size != page size. If your proposal is that we automatically > enable diread_nolock when we can use it safely, that's definitely > something to consider for the next merge window. Yes, I also think we can automatically enable dioread_nolock because it brings us some benefits. BTW, I think there is an minor improvement for dio overwrite codepath with indirect-based file. We don't need to take i_mutex in this condition just as we have done for extent-based file. If a user mounts a ext2/3 file system with a ext4 kernel modules, he/she could get a lower latency. But it seems that it would break dio semantic in ext2/3. Currently in ext2/3 if we issue a overwrite dio and then issue a read dio. We will always read the latest data because we wait on i_mutex lock. But after parallelizing overwite dio, this semantic might breaks. I re-read this doc but it seems that it doesn't describe this case. Do we need to keep this semantic? > > My long range plan/hope is that we eventually be able to use the > extent status tree so that we do allocating writes, we first (a) > allocate the blocks, and mark them as in use as far as the mballoc > data structures are concerned, but we do _not_ mark them as in use in > the on-disk allocation bitmaps, then (b) we write the data blocks, and > then triggered by the block I/O completion, (c) in a single journal > trnasaction, we update the allocation bitmaps, update the inode's > extent tree, and update the inode's i_size field. > > This is different from the dioread_nolock approach in that we're not > initially inserting the blocks in the extent tree as uninitialized, > and then convert the extent tree entries from uninit to init after the > I/O completion. Yes, this approach is better. I am happy to work on this. Regards, - Zheng -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html