On Jul 11, 2024, at 10:16 AM, Wang Jianjian <wangjianjian0@xxxxxxxxxxx> wrote: > On 2024/7/11 23:16, Luis Henriques wrote: >> On Thu, Jul 11 2024, wangjianjian (C) wrote: >> >>> On 2024/7/11 16:35, Luis Henriques (SUSE) wrote: >>>> When a full journal commit is on-going, any fast commit has to be enqueued >>>> into a different queue: FC_Q_STAGING instead of FC_Q_MAIN. This enqueueing >>>> is done only once, i.e. if an inode is already queued in a previous fast >>>> commit entry it won't be enqueued again. However, if a full commit starts >>>> _after_ the inode is enqueued into FC_Q_MAIN, the next fast commit needs to >>>> be done into FC_Q_STAGING. And this is not being done in function >>>> ext4_fc_track_template(). >>>> This patch fixes the issue by re-enqueuing an inode into the STAGING queue >>>> during the fast commit clean-up callback if it has a tid (i_sync_tid) >>>> greater than the one being handled. The STAGING queue will then be spliced >>>> back into MAIN. >>>> This bug was found using fstest generic/047. This test creates several 32k >>>> bytes files, sync'ing each of them after it's creation, and then shutting >>>> down the filesystem. Some data may be loss in this operation; for example a >>>> file may have it's size truncated to zero. >>>> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@xxxxxxxxx> >>>> --- >>>> Hi! >>>> v4 of this patch enqueues the inode into STAGING *only* if the current tid >>>> is non-zero. It will be zero when doing an fc commit, and this would mean >>>> to always re-enqueue the inode. This fixes the regressions caught by Ted >>>> in v3 with fstests generic/472 generic/496 generic/643. >>>> Also, since 2nd patch of v3 has already been merged, I've rebased this patch >>>> to be applied on top of it. >>>> fs/ext4/fast_commit.c | 10 ++++++++++ >>>> 1 file changed, 10 insertions(+) >>>> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c >>>> index 3926a05eceee..facbc8dbbaa2 100644 >>>> --- a/fs/ext4/fast_commit.c >>>> +++ b/fs/ext4/fast_commit.c >>>> @@ -1290,6 +1290,16 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid) >>>> EXT4_STATE_FC_COMMITTING); >>>> if (tid_geq(tid, iter->i_sync_tid)) >>>> ext4_fc_reset_inode(&iter->vfs_inode); >>>> + } else if (tid) { >>>> + /* >>>> + * If the tid is valid (i.e. non-zero) re-enqueue the >>> one quick question about tid, if one disk is using long time and its tid get >>> wrapped to 0, is it a valid seq? I don't find code handling this situation. >> Hmm... OK. So, to answer to your question, the 'tid' is expected to wrap. >> That's why we use: >> >> if (tid_geq(tid, iter->i_sync_tid)) > Yes, I know this. >> >> instead of: >> >> if (tid >= iter->i_sync_tid) >> >> (The second patch in v3 actually fixed a few places where the tid_*() >> helpers weren't being used.) >> >> But your question shows me that my patch is wrong as '0' may actually be a >> valid 'tid' value. > > Actually my question is, there are some place use '0' to check if a transaction is valid, e.g. > > In ext4_wait_for_tail_page_commit() > > 5218 while (1) { > 5219 struct folio *folio = filemap_lock_folio(inode->i_mapping, > 5220 inode->i_size >> PAGE_SHIFT); > 5221 if (IS_ERR(folio)) > 5222 return; > 5223 ret = __ext4_journalled_invalidate_folio(folio, offset, > 5224 folio_size(folio) - offset); > 5225 folio_unlock(folio); > 5226 folio_put(folio); > 5227 if (ret != -EBUSY) > 5228 return; > 5229 commit_tid = 0; > 5230 read_lock(&journal->j_state_lock); > 5231 if (journal->j_committing_transaction) > 5232 commit_tid = journal->j_committing_transaction->t_tid; > 5233 read_unlock(&journal->j_state_lock); > 5234 if (commit_tid) > 5235 jbd2_log_wait_commit(journal, commit_tid); > 5236 } > 5237 We only wait commit if tid is not zero. > > And in __jbd2_log_wait_for_space() > > 79 if (space_left < nblocks) { > 80 int chkpt = journal->j_checkpoint_transactions != NULL; > 81 tid_t tid = 0; > 82 > 83 if (journal->j_committing_transaction) > 84 tid = journal->j_committing_transaction->t_tid; > 85 spin_unlock(&journal->j_list_lock); > 86 write_unlock(&journal->j_state_lock); > 87 if (chkpt) { > 88 jbd2_log_do_checkpoint(journal); > 89 } else if (jbd2_cleanup_journal_tail(journal) == 0) { > 90 /* We were able to recover space; yay! */ > 91 ; > 92 } else if (tid) { > 93 /* > 94 * jbd2_journal_commit_transaction() may want > 95 * to take the checkpoint_mutex if JBD2_FLUSHED > 96 * is set. So we need to temporarily drop it. > 97 */ > 98 mutex_unlock(&journal->j_checkpoint_mutex); > 99 jbd2_log_wait_commit(journal, tid); > 100 write_lock(&journal->j_state_lock); > 101 continue; > We also only wait commit if tid is not zero. > > Does it mean all these have bugs if '0' is a valid 'tid' ? > > But on the other hand, if we don't consider sync and fsync, and default commit interval is 5s, > > time of tid wrap to 0 is nearly 680 years. However, we can run sync/fsync to make tid to increase > > more quickly in real world ? The simple solution is that "0" is not a valid sequence. It looks like this is a bug in jbd2_get_transaction() where it is incrementing the TID: transaction->t_tid = journal->j_transaction_sequence++; it should add a check to handle the wrap-around: if (unlikely(transaction->t_tid == 0)) transaction->t_tid = journal->j_transaction_sequence++; Cheers, Andreas
Attachment:
signature.asc
Description: Message signed with OpenPGP