On Thu, Jul 11 2024, Andreas Dilger wrote: > On Jul 11, 2024, at 10:16 AM, Wang Jianjian <wangjianjian0@xxxxxxxxxxx> wrote: >> On 2024/7/11 23:16, Luis Henriques wrote: >>> On Thu, Jul 11 2024, wangjianjian (C) wrote: >>> >>>> On 2024/7/11 16:35, Luis Henriques (SUSE) wrote: >>>>> When a full journal commit is on-going, any fast commit has to be enqueued >>>>> into a different queue: FC_Q_STAGING instead of FC_Q_MAIN. This enqueueing >>>>> is done only once, i.e. if an inode is already queued in a previous fast >>>>> commit entry it won't be enqueued again. However, if a full commit starts >>>>> _after_ the inode is enqueued into FC_Q_MAIN, the next fast commit needs to >>>>> be done into FC_Q_STAGING. And this is not being done in function >>>>> ext4_fc_track_template(). >>>>> This patch fixes the issue by re-enqueuing an inode into the STAGING queue >>>>> during the fast commit clean-up callback if it has a tid (i_sync_tid) >>>>> greater than the one being handled. The STAGING queue will then be spliced >>>>> back into MAIN. >>>>> This bug was found using fstest generic/047. This test creates several 32k >>>>> bytes files, sync'ing each of them after it's creation, and then shutting >>>>> down the filesystem. Some data may be loss in this operation; for example a >>>>> file may have it's size truncated to zero. >>>>> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@xxxxxxxxx> >>>>> --- >>>>> Hi! >>>>> v4 of this patch enqueues the inode into STAGING *only* if the current tid >>>>> is non-zero. It will be zero when doing an fc commit, and this would mean >>>>> to always re-enqueue the inode. This fixes the regressions caught by Ted >>>>> in v3 with fstests generic/472 generic/496 generic/643. >>>>> Also, since 2nd patch of v3 has already been merged, I've rebased this patch >>>>> to be applied on top of it. >>>>> fs/ext4/fast_commit.c | 10 ++++++++++ >>>>> 1 file changed, 10 insertions(+) >>>>> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c >>>>> index 3926a05eceee..facbc8dbbaa2 100644 >>>>> --- a/fs/ext4/fast_commit.c >>>>> +++ b/fs/ext4/fast_commit.c >>>>> @@ -1290,6 +1290,16 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid) >>>>> EXT4_STATE_FC_COMMITTING); >>>>> if (tid_geq(tid, iter->i_sync_tid)) >>>>> ext4_fc_reset_inode(&iter->vfs_inode); >>>>> + } else if (tid) { >>>>> + /* >>>>> + * If the tid is valid (i.e. non-zero) re-enqueue the >>>> one quick question about tid, if one disk is using long time and its tid get >>>> wrapped to 0, is it a valid seq? I don't find code handling this situation. >>> Hmm... OK. So, to answer to your question, the 'tid' is expected to wrap. >>> That's why we use: >>> >>> if (tid_geq(tid, iter->i_sync_tid)) >> Yes, I know this. >>> >>> instead of: >>> >>> if (tid >= iter->i_sync_tid) >>> >>> (The second patch in v3 actually fixed a few places where the tid_*() >>> helpers weren't being used.) >>> >>> But your question shows me that my patch is wrong as '0' may actually be a >>> valid 'tid' value. >> >> Actually my question is, there are some place use '0' to check if a transaction is valid, e.g. >> >> In ext4_wait_for_tail_page_commit() >> >> 5218 while (1) { >> 5219 struct folio *folio = filemap_lock_folio(inode->i_mapping, >> 5220 inode->i_size >> PAGE_SHIFT); >> 5221 if (IS_ERR(folio)) >> 5222 return; >> 5223 ret = __ext4_journalled_invalidate_folio(folio, offset, >> 5224 folio_size(folio) - offset); >> 5225 folio_unlock(folio); >> 5226 folio_put(folio); >> 5227 if (ret != -EBUSY) >> 5228 return; >> 5229 commit_tid = 0; >> 5230 read_lock(&journal->j_state_lock); >> 5231 if (journal->j_committing_transaction) >> 5232 commit_tid = journal->j_committing_transaction->t_tid; >> 5233 read_unlock(&journal->j_state_lock); >> 5234 if (commit_tid) >> 5235 jbd2_log_wait_commit(journal, commit_tid); >> 5236 } >> 5237 We only wait commit if tid is not zero. >> >> And in __jbd2_log_wait_for_space() >> >> 79 if (space_left < nblocks) { >> 80 int chkpt = journal->j_checkpoint_transactions != NULL; >> 81 tid_t tid = 0; >> 82 >> 83 if (journal->j_committing_transaction) >> 84 tid = journal->j_committing_transaction->t_tid; >> 85 spin_unlock(&journal->j_list_lock); >> 86 write_unlock(&journal->j_state_lock); >> 87 if (chkpt) { >> 88 jbd2_log_do_checkpoint(journal); >> 89 } else if (jbd2_cleanup_journal_tail(journal) == 0) { >> 90 /* We were able to recover space; yay! */ >> 91 ; >> 92 } else if (tid) { >> 93 /* >> 94 * jbd2_journal_commit_transaction() may want >> 95 * to take the checkpoint_mutex if JBD2_FLUSHED >> 96 * is set. So we need to temporarily drop it. >> 97 */ >> 98 mutex_unlock(&journal->j_checkpoint_mutex); >> 99 jbd2_log_wait_commit(journal, tid); >> 100 write_lock(&journal->j_state_lock); >> 101 continue; >> We also only wait commit if tid is not zero. >> >> Does it mean all these have bugs if '0' is a valid 'tid' ? >> >> But on the other hand, if we don't consider sync and fsync, and default commit interval is 5s, >> >> time of tid wrap to 0 is nearly 680 years. However, we can run sync/fsync to make tid to increase >> >> more quickly in real world ? > > The simple solution is that "0" is not a valid sequence. It looks like > this is a bug in jbd2_get_transaction() where it is incrementing the TID: > > transaction->t_tid = journal->j_transaction_sequence++; > > it should add a check to handle the wrap-around: > > if (unlikely(transaction->t_tid == 0)) > transaction->t_tid = journal->j_transaction_sequence++; Sound good to me. I can prepare a patch with this change if no one else sees other issues. As far as I can see, this shouldn't be a problem even when replaying journals that still have a '0' tid. Thanks, Andreas. And thanks Wang, for spotting this. Cheers, -- Luís