tytso@xxxxxxx writes: > On Mon, Mar 22, 2010 at 05:04:19PM +0300, Dmitry Monakhov wrote: >> tytso@xxxxxxx writes: >> >> > On Fri, Mar 12, 2010 at 08:26:49PM +0300, Dmitry Monakhov wrote: >> >> start_journal_io: >> >> + if (bufs) >> >> + commit_transaction->t_flushed_data_blocks = 1; >> >> + >> > >> > I'm not convinced this is right. >> > >> > From your test case, the problem isn't because we have journaled >> > metadata blocks (which is what bufs) counts, but because fsync() >> > depends on data blocks also getting flushed out to disks. >> > >> > However, if we aren't closing the transaction because of fsync(), I >> > don't think we need to do a barrier in the case of an external >> > journal. So instead of effectively unconditionally setting >> > t_flushed_data_blocks (since bufs is nearly always going to be >> > non-zero), I think the better fix is to test to see if the journal >> > device != to the fs data device in fsync(), and if so, start the >> > barrier operation there. >> > >> > Do you agree? >> Yes. > > Just to be clear, since I realized I wrote fsync() when I should have > written fs/ext4/fsync.c, my proposal was to put this check in > ext4_sync_file(). This means that we will end up with two io-barriers in a row for external journal case: ext4_sync_file() ->jbd2_log_start_commit() ->blkdev_issue_flush(j_fs_dev) /* seems following flush is mandatory to guarantee the metadata * consistency */ ->blkdev_issue_flush(j_fs_dev) may be it will be better to pass some sort of barrier flag to jbd2_log_start_commit() this function mark journal->j_commit_request with ->t_flushed_data_blocks = 1, so io-carrier will be issued even if transaction has only metadata Advantages: 1) approach will works for all data modes 2) only one io-barrier is needed to guarantee the data and metadata consiscency. 3) changes not so complicated. I've already started to work with the patch but was surprised with commit logic. ->__jbd2_log_start_commit(target) journal->j_commit_request = target ->wake_up(&journal->j_wait_commit) ->kjournald2 transaction = journal->j_running_transaction ->jbd2_journal_commit_transaction(journal); commit_transaction = journal->j_running_transaction ... /* So j_commit_request is used only for comparison. But actually committing journal->j_running_transaction->t_tid transaction instead of j_commit_request. Why? */ > >> BTW Would it be correct to update j_tail in >> jbd2_journal_commit_transaction() to something more recent >> if we have issued an io-barrier to j_fs_dev? >> This will helps to reduce journal_recovery time which may be really >> painful in some slow devices. > > Um, maybe. We are already calling __jbd2_journal_clean_checkpoint_list(), > and the barrier operation *is* expensive. > > On the other hand, updating the journal superblock on every sync is > another seek that would have to made before the barrier operation, and > I'm a bit concerned that this seek would be noticeable. If it is > noticeable, is it worth it to optimize for the uncommon case (a power > failure requiring a journal replay) when it might cost us something, > however, small, on every single journal update? > > Do we really think the journal replay time is really something that is > a major pain point. I can think of optimizations that involve > skipping writes that will get updated later in future transactions, > but it means complicating the replay code, which has been stable for a > long time, and it's not clear to me that the costs are worth the > benefits. Never mind. It was just my thoughts. The price of broken recovery stage is to expansive to fix such rare cases. > >> I've take a look at async commit logic: fs/jbd2/commit.c >> void jbd2_journal_commit_transaction(journal_t *journal) >> { >> 725: /* Done it all: now write the commit record asynchronously. */ >> if (JBD2_HAS_INCOMPAT_FEATURE(journal, >> JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) >> { >> err = journal_submit_commit_record(journal, >> commit_transaction, >> &cbh, crc32_sum); >> if (err) >> __jbd2_journal_abort_hard(journal); >> if (journal->j_flags & JBD2_BARRIER) >> blkdev_issue_flush(journal->j_dev, NULL); >> <<< blkdev_issue_flush is wait for barrier to complete by default, but >> <<< in fact we don't have to wait for barrier here. I've prepared a >> <<< patch wich add flags to control blkdev_issue_flush() wait >> <<< behavior, and this is the place for no-wait variant. > > I think that's right, as long as we're confident that subsequent > writes won't get scheduled before the no-wait barrier. If it did, it > would be a bug in the block I/O layer, so it should be OK. > > - Ted > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html