On Aug 24, 2009 18:07 -0400, Theodore Ts'o wrote: > What ext3 and ext4 does by default is this: > > 1) Write data blocks required by data=ordered mode (if any) > > 2) Write the journal blocks > > 3) Wait for the journal blocks to be sent to disk. (We don't actually > do a barrier operation), so this just means the blocks have been > sent to the disk, not necessarily that they are forced to a platter. Hmm, I think you are missing a step here. In both jbd and jbd2 there is a wait for these buffers to hit the disk. In the jbd case it is at "commit phase 2", and in jbd2 it is at "wait_for_iobuf". > 4) Write the commit block, with the barrier flag set. > > 5) Wait for the commit block. > > ----- > > What the current async commit code does is this: > > 1) Write data blocks required by data=ordered mode (if any) > > 2) Write the journal blocks > > 3) Write the commit block, without a barrier. > > 4) Wait for the journal blocks to be sent to disk. > > 5) Wait for the commit block (since a barrier is requested, this is > just when it was sent to the disk, not when it is actually committed > to stable store). Similarly, in the async case, all of the data blocks and the commit block are waited on, AFAICS. It's just that with async_commit the commit block is submitted with the data blocks, and in case of a crash the transaction checksum is needed to determine if the commit block is valid or not. > What I think we can do safely in ext4 is this: > > 1) Write data blocks required by data=ordered mode (if any) > > 2) Write the journal blocks > > 3) Write the commit block, WITH a barrier requested. > > 4) Wait for the commit block to be completed. > > 5) Wait for the journal blocks to be sent to disk. #4 implies that > all of the journal block I/O will have been completed, so this is just > to collect the commit completion status; we should actually block > during step #5, assuming the block layer's barrier operation was > implemented correctly. Since a barrier is a painful operation, it is better to just wait explicitly on the completion of the various blocks as needed (i.e. journal data + commit block). That avoids the huge wait on many other blocks that may have been sent to disk unrelated to the journal itself, if the journal is on the same device as the filesystem. > This should save us a little bit, since it implies the commit record > will be sent to disk in the same I/O request to the storage device as > the the other journal blocks, which is _not_ currently the case today. Are you _really_ sure that isn't what is done today? My reading of the code is different, but it's of course possible that I'm seeing what I want to see (which is how it was originally designed) and not what is really there. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html