On Tue 13-03-12 22:17:45, Ted Tso wrote: > I'm not sure why you pulled out the code for > jbd2_journal_get_log_tail(), since there's no other usage of the > function, but it did make this commit harder to review. It's good for > code movement patches to be segregated to their own commit, just to > make it easier to review.... I pulled the code out because the last patch in the series uses it from commit code. So it was not a deliberate code movement. But I agree it might have been easier to review if the code movement was a separate patch. Should I do that or have you already coped with the patch as is? Honza > On Wed, Feb 15, 2012 at 07:34:09PM +0100, Jan Kara wrote: > > When we reach jbd2_cleanup_journal_tail(), there is no guarantee that > > checkpointed buffers are on a stable storage - especially if buffers were > > written out by jbd2_log_do_checkpoint(), they are likely to be only in disk's > > caches. Thus when we update journal superblock effectively removing old > > transaction from journal, this write of superblock can get to stable storage > > before those checkpointed buffers which can result in filesystem corruption > > after a crash. Thus we must unconditionally issue a cache flush before we > > update journal superblock in these cases. > > > > A similar problem can also occur if journal superblock is written only in > > disk's caches, other transaction starts reusing space of the transaction > > cleaned from the log and power failure happens. Subsequent journal replay would > > still try to replay the old transaction but some of it's blocks may be already > > overwritten by the new transaction. For this reason we must use WRITE_FUA when > > updating log tail and we must first write new log tail to disk and update > > in-memory information only after that. > > > > Signed-off-by: Jan Kara <jack@xxxxxxx> > > --- > > fs/jbd2/checkpoint.c | 75 ++++-------------------- > > fs/jbd2/commit.c | 11 +++- > > fs/jbd2/journal.c | 136 ++++++++++++++++++++++++++++++++++++------ > > fs/jbd2/recovery.c | 5 +- > > include/linux/jbd2.h | 6 ++- > > include/trace/events/jbd2.h | 2 +- > > 6 files changed, 148 insertions(+), 87 deletions(-) > > > > diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c > > index aa09868..de76f1f 100644 > > --- a/fs/jbd2/checkpoint.c > > +++ b/fs/jbd2/checkpoint.c > > @@ -478,79 +478,28 @@ out: > > > > int jbd2_cleanup_journal_tail(journal_t *journal) > > { > > - transaction_t * transaction; > > tid_t first_tid; > > - unsigned long blocknr, freed; > > + unsigned long blocknr; > > > > if (is_journal_aborted(journal)) > > return 1; > > > > - /* OK, work out the oldest transaction remaining in the log, and > > - * the log block it starts at. > > - * > > - * If the log is now empty, we need to work out which is the > > - * next transaction ID we will write, and where it will > > - * start. */ > > - > > - write_lock(&journal->j_state_lock); > > - spin_lock(&journal->j_list_lock); > > - transaction = journal->j_checkpoint_transactions; > > - if (transaction) { > > - first_tid = transaction->t_tid; > > - blocknr = transaction->t_log_start; > > - } else if ((transaction = journal->j_committing_transaction) != NULL) { > > - first_tid = transaction->t_tid; > > - blocknr = transaction->t_log_start; > > - } else if ((transaction = journal->j_running_transaction) != NULL) { > > - first_tid = transaction->t_tid; > > - blocknr = journal->j_head; > > - } else { > > - first_tid = journal->j_transaction_sequence; > > - blocknr = journal->j_head; > > - } > > - spin_unlock(&journal->j_list_lock); > > - J_ASSERT(blocknr != 0); > > - > > - /* If the oldest pinned transaction is at the tail of the log > > - already then there's not much we can do right now. */ > > - if (journal->j_tail_sequence == first_tid) { > > - write_unlock(&journal->j_state_lock); > > + if (!jbd2_journal_get_log_tail(journal, &first_tid, &blocknr)) > > return 1; > > - } > > - > > - /* OK, update the superblock to recover the freed space. > > - * Physical blocks come first: have we wrapped beyond the end of > > - * the log? */ > > - freed = blocknr - journal->j_tail; > > - if (blocknr < journal->j_tail) > > - freed = freed + journal->j_last - journal->j_first; > > - > > - trace_jbd2_cleanup_journal_tail(journal, first_tid, blocknr, freed); > > - jbd_debug(1, > > - "Cleaning journal tail from %d to %d (offset %lu), " > > - "freeing %lu\n", > > - journal->j_tail_sequence, first_tid, blocknr, freed); > > - > > - journal->j_free += freed; > > - journal->j_tail_sequence = first_tid; > > - journal->j_tail = blocknr; > > - write_unlock(&journal->j_state_lock); > > + J_ASSERT(blocknr != 0); > > > > /* > > - * If there is an external journal, we need to make sure that > > - * any data blocks that were recently written out --- perhaps > > - * by jbd2_log_do_checkpoint() --- are flushed out before we > > - * drop the transactions from the external journal. It's > > - * unlikely this will be necessary, especially with a > > - * appropriately sized journal, but we need this to guarantee > > - * correctness. Fortunately jbd2_cleanup_journal_tail() > > - * doesn't get called all that often. > > + * We need to make sure that any blocks that were recently written out > > + * --- perhaps by jbd2_log_do_checkpoint() --- are flushed out before > > + * we drop the transactions from the journal. It's unlikely this will > > + * be necessary, especially with an appropriately sized journal, but we > > + * need this to guarantee correctness. Fortunately > > + * jbd2_cleanup_journal_tail() doesn't get called all that often. > > */ > > - if ((journal->j_fs_dev != journal->j_dev) && > > - (journal->j_flags & JBD2_BARRIER)) > > + if (journal->j_flags & JBD2_BARRIER) > > blkdev_issue_flush(journal->j_fs_dev, GFP_KERNEL, NULL); > > - if (!(journal->j_flags & JBD2_ABORT)) > > - jbd2_journal_update_sb_log_tail(journal); > > + > > + __jbd2_update_log_tail(journal, first_tid, blocknr); > > return 0; > > } > > > > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c > > index 39978c1..f37b783 100644 > > --- a/fs/jbd2/commit.c > > +++ b/fs/jbd2/commit.c > > @@ -341,7 +341,16 @@ void jbd2_journal_commit_transaction(journal_t *journal) > > if (journal->j_flags & JBD2_FLUSHED) { > > jbd_debug(3, "super block updated\n"); > > mutex_lock(&journal->j_checkpoint_mutex); > > - jbd2_journal_update_sb_log_tail(journal); > > + /* > > + * We hold j_checkpoint_mutex so tail cannot change under us. > > + * We don't need any special data guarantees for writing sb > > + * since journal is empty and it is ok for write to be > > + * flushed only with transaction commit. > > + */ > > + jbd2_journal_update_sb_log_tail(journal, > > + journal->j_tail_sequence, > > + journal->j_tail, > > + WRITE_SYNC); > > mutex_unlock(&journal->j_checkpoint_mutex); > > } else { > > jbd_debug(3, "superblock not updated\n"); > > diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c > > index 183a099..192558f 100644 > > --- a/fs/jbd2/journal.c > > +++ b/fs/jbd2/journal.c > > @@ -744,6 +744,85 @@ struct journal_head *jbd2_journal_get_descriptor_buffer(journal_t *journal) > > return jbd2_journal_add_journal_head(bh); > > } > > > > +/* > > + * Return tid of the oldest transaction in the journal and block in the journal > > + * where the transaction starts. > > + * > > + * If the journal is now empty, return which will be the next transaction ID > > + * we will write and where will that transaction start. > > + * > > + * The return value is 0 if journal tail cannot be pushed any further, 1 if > > + * it can. > > + */ > > +int jbd2_journal_get_log_tail(journal_t *journal, tid_t *tid, > > + unsigned long *block) > > +{ > > + transaction_t *transaction; > > + int ret; > > + > > + read_lock(&journal->j_state_lock); > > + spin_lock(&journal->j_list_lock); > > + transaction = journal->j_checkpoint_transactions; > > + if (transaction) { > > + *tid = transaction->t_tid; > > + *block = transaction->t_log_start; > > + } else if ((transaction = journal->j_committing_transaction) != NULL) { > > + *tid = transaction->t_tid; > > + *block = transaction->t_log_start; > > + } else if ((transaction = journal->j_running_transaction) != NULL) { > > + *tid = transaction->t_tid; > > + *block = journal->j_head; > > + } else { > > + *tid = journal->j_transaction_sequence; > > + *block = journal->j_head; > > + } > > + ret = tid_gt(*tid, journal->j_tail_sequence); > > + spin_unlock(&journal->j_list_lock); > > + read_unlock(&journal->j_state_lock); > > + > > + return ret; > > +} > > + > > +/* > > + * Update information in journal structure and in on disk journal superblock > > + * about log tail. This function does not check whether information passed in > > + * really pushes log tail further. It's responsibility of the caller to make > > + * sure provided log tail information is valid (e.g. by holding > > + * j_checkpoint_mutex all the time between computing log tail and calling this > > + * function as is the case with jbd2_cleanup_journal_tail()). > > + * > > + * Requires j_checkpoint_mutex > > + */ > > +void __jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block) > > +{ > > + unsigned long freed; > > + > > + BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex)); > > + > > + /* > > + * We cannot afford for write to remain in drive's caches since as > > + * soon as we update j_tail, next transaction can start reusing journal > > + * space and if we lose sb update during power failure we'd replay > > + * old transaction with possibly newly overwritten data. > > + */ > > + jbd2_journal_update_sb_log_tail(journal, tid, block, WRITE_FUA); > > + write_lock(&journal->j_state_lock); > > + freed = block - journal->j_tail; > > + if (block < journal->j_tail) > > + freed += journal->j_last - journal->j_first; > > + > > + trace_jbd2_update_log_tail(journal, tid, block, freed); > > + jbd_debug(1, > > + "Cleaning journal tail from %d to %d (offset %lu), " > > + "freeing %lu\n", > > + journal->j_tail_sequence, tid, block, freed); > > + > > + journal->j_free += freed; > > + journal->j_tail_sequence = tid; > > + journal->j_tail = block; > > + write_unlock(&journal->j_state_lock); > > +} > > + > > struct jbd2_stats_proc_session { > > journal_t *journal; > > struct transaction_stats_s *stats; > > @@ -1127,17 +1206,29 @@ static int journal_reset(journal_t *journal) > > } else { > > /* Lock here to make assertions happy... */ > > mutex_lock(&journal->j_checkpoint_mutex); > > - /* Add the dynamic fields and write it to disk. */ > > - jbd2_journal_update_sb_log_tail(journal); > > + /* > > + * Update log tail information. We use WRITE_FUA since new > > + * transaction will start reusing journal space and so we > > + * must make sure information about current log tail is on > > + * disk before that. > > + */ > > + jbd2_journal_update_sb_log_tail(journal, > > + journal->j_tail_sequence, > > + journal->j_tail, > > + WRITE_FUA); > > mutex_unlock(&journal->j_checkpoint_mutex); > > } > > return jbd2_journal_start_thread(journal); > > } > > > > -static void jbd2_write_superblock(journal_t *journal) > > +static void jbd2_write_superblock(journal_t *journal, int write_op) > > { > > struct buffer_head *bh = journal->j_sb_buffer; > > + int ret; > > > > + if (!(journal->j_flags & JBD2_BARRIER)) > > + write_op &= ~(REQ_FUA | REQ_FLUSH); > > + lock_buffer(bh); > > if (buffer_write_io_error(bh)) { > > /* > > * Oh, dear. A previous attempt to write the journal > > @@ -1153,40 +1244,45 @@ static void jbd2_write_superblock(journal_t *journal) > > clear_buffer_write_io_error(bh); > > set_buffer_uptodate(bh); > > } > > - > > - BUFFER_TRACE(bh, "marking dirty"); > > - mark_buffer_dirty(bh); > > - sync_dirty_buffer(bh); > > + get_bh(bh); > > + bh->b_end_io = end_buffer_write_sync; > > + ret = submit_bh(write_op, bh); > > + wait_on_buffer(bh); > > if (buffer_write_io_error(bh)) { > > - printk(KERN_ERR "JBD2: I/O error detected " > > - "when updating journal superblock for %s.\n", > > - journal->j_devname); > > clear_buffer_write_io_error(bh); > > set_buffer_uptodate(bh); > > + ret = -EIO; > > + } > > + if (ret) { > > + printk(KERN_ERR "JBD2: Error %d detected when updating " > > + "journal superblock for %s.\n", ret, > > + journal->j_devname); > > } > > } > > > > /** > > * jbd2_journal_update_sb_log_tail() - Update log tail in journal sb on disk. > > * @journal: The journal to update. > > + * @tail_tid: TID of the new transaction at the tail of the log > > + * @tail_block: The first block of the transaction at the tail of the log > > + * @write_op: With which operation should we write the journal sb > > * > > * Update a journal's superblock information about log tail and write it to > > * disk, waiting for the IO to complete. > > */ > > -void jbd2_journal_update_sb_log_tail(journal_t *journal) > > +void jbd2_journal_update_sb_log_tail(journal_t *journal, tid_t tail_tid, > > + unsigned long tail_block, int write_op) > > { > > journal_superblock_t *sb = journal->j_superblock; > > > > BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex)); > > - read_lock(&journal->j_state_lock); > > - jbd_debug(1, "JBD2: updating superblock (start %ld, seq %d)\n", > > - journal->j_tail, journal->j_tail_sequence); > > + jbd_debug(1, "JBD2: updating superblock (start %lu, seq %u)\n", > > + tail_block, tail_tid); > > > > - sb->s_sequence = cpu_to_be32(journal->j_tail_sequence); > > - sb->s_start = cpu_to_be32(journal->j_tail); > > - read_unlock(&journal->j_state_lock); > > + sb->s_sequence = cpu_to_be32(tail_tid); > > + sb->s_start = cpu_to_be32(tail_block); > > > > - jbd2_write_superblock(journal); > > + jbd2_write_superblock(journal, write_op); > > > > /* Log is no longer empty */ > > write_lock(&journal->j_state_lock); > > @@ -1215,7 +1311,7 @@ static void jbd2_mark_journal_empty(journal_t *journal) > > sb->s_start = cpu_to_be32(0); > > read_unlock(&journal->j_state_lock); > > > > - jbd2_write_superblock(journal); > > + jbd2_write_superblock(journal, WRITE_FUA); > > > > /* Log is no longer empty */ > > write_lock(&journal->j_state_lock); > > @@ -1241,7 +1337,7 @@ static void jbd2_journal_update_sb_errno(journal_t *journal) > > sb->s_errno = cpu_to_be32(journal->j_errno); > > read_unlock(&journal->j_state_lock); > > > > - jbd2_write_superblock(journal); > > + jbd2_write_superblock(journal, WRITE_SYNC); > > } > > > > /* > > diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c > > index da6d7ba..c1a0335 100644 > > --- a/fs/jbd2/recovery.c > > +++ b/fs/jbd2/recovery.c > > @@ -21,6 +21,7 @@ > > #include <linux/jbd2.h> > > #include <linux/errno.h> > > #include <linux/crc32.h> > > +#include <linux/blkdev.h> > > #endif > > > > /* > > @@ -265,7 +266,9 @@ int jbd2_journal_recover(journal_t *journal) > > err2 = sync_blockdev(journal->j_fs_dev); > > if (!err) > > err = err2; > > - > > + /* Make sure all replayed data is on permanent storage */ > > + if (journal->j_flags & JBD2_BARRIER) > > + blkdev_issue_flush(journal->j_fs_dev, GFP_KERNEL, NULL); > > return err; > > } > > > > diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h > > index 78da912..1b10c2a 100644 > > --- a/include/linux/jbd2.h > > +++ b/include/linux/jbd2.h > > @@ -971,6 +971,9 @@ extern void __journal_clean_data_list(transaction_t *transaction); > > /* Log buffer allocation */ > > extern struct journal_head * jbd2_journal_get_descriptor_buffer(journal_t *); > > int jbd2_journal_next_log_block(journal_t *, unsigned long long *); > > +int jbd2_journal_get_log_tail(journal_t *journal, tid_t *tid, > > + unsigned long *block); > > +void __jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block); > > > > /* Commit management */ > > extern void jbd2_journal_commit_transaction(journal_t *); > > @@ -1082,7 +1085,8 @@ extern int jbd2_journal_destroy (journal_t *); > > extern int jbd2_journal_recover (journal_t *journal); > > extern int jbd2_journal_wipe (journal_t *, int); > > extern int jbd2_journal_skip_recovery (journal_t *); > > -extern void jbd2_journal_update_sb_log_tail (journal_t *); > > +extern void jbd2_journal_update_sb_log_tail (journal_t *, tid_t, > > + unsigned long, int); > > extern void __jbd2_journal_abort_hard (journal_t *); > > extern void jbd2_journal_abort (journal_t *, int); > > extern int jbd2_journal_errno (journal_t *); > > diff --git a/include/trace/events/jbd2.h b/include/trace/events/jbd2.h > > index 7596441..5c74007 100644 > > --- a/include/trace/events/jbd2.h > > +++ b/include/trace/events/jbd2.h > > @@ -200,7 +200,7 @@ TRACE_EVENT(jbd2_checkpoint_stats, > > __entry->forced_to_close, __entry->written, __entry->dropped) > > ); > > > > -TRACE_EVENT(jbd2_cleanup_journal_tail, > > +TRACE_EVENT(jbd2_update_log_tail, > > > > TP_PROTO(journal_t *journal, tid_t first_tid, > > unsigned long block_nr, unsigned long freed), > > -- > > 1.7.1 > > -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html