On 2024/11/8 18:33, Jan Kara wrote: > On Tue 05-11-24 14:44:28, Li Dongyang wrote: >> Resizable hashtable should improve journal replay time when >> we have million of revoke records. >> Notice that rhashtable is used during replay only, >> as removal with list_del() is less expensive and it's still used >> during regular processing. >> >> before: >> 1048576 records - 95 seconds >> 2097152 records - 580 seconds > > These are really high numbers of revoke records. Deleting couple GB of > metadata doesn't happen so easily. Are they from a real workload or just > a stress test? > >> after: >> 1048576 records - 2 seconds >> 2097152 records - 3 seconds >> 4194304 records - 7 seconds > > The gains are very nice :). > >> Signed-off-by: Alex Zhuravlev <bzzz@xxxxxxxxxxxxx> >> Signed-off-by: Li Dongyang <dongyangli@xxxxxxx> > >> diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c >> index 667f67342c52..d9287439171c 100644 >> --- a/fs/jbd2/recovery.c >> +++ b/fs/jbd2/recovery.c >> @@ -294,6 +294,10 @@ int jbd2_journal_recover(journal_t *journal) >> memset(&info, 0, sizeof(info)); >> sb = journal->j_superblock; >> >> + err = jbd2_journal_init_recovery_revoke(journal); >> + if (err) >> + return err; >> + >> /* >> * The journal superblock's s_start field (the current log head) >> * is always zero if, and only if, the journal was cleanly >> diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c >> index 4556e4689024..d6e96099e9c9 100644 >> --- a/fs/jbd2/revoke.c >> +++ b/fs/jbd2/revoke.c >> @@ -90,6 +90,7 @@ >> #include <linux/bio.h> >> #include <linux/log2.h> >> #include <linux/hash.h> >> +#include <linux/rhashtable.h> >> #endif >> >> static struct kmem_cache *jbd2_revoke_record_cache; >> @@ -101,7 +102,10 @@ static struct kmem_cache *jbd2_revoke_table_cache; >> >> struct jbd2_revoke_record_s >> { >> - struct list_head hash; >> + union { >> + struct list_head hash; >> + struct rhash_head linkage; >> + }; >> tid_t sequence; /* Used for recovery only */ >> unsigned long long blocknr; >> }; >> @@ -680,13 +684,22 @@ static void flush_descriptor(journal_t *journal, >> * single block. >> */ >> >> +static const struct rhashtable_params revoke_rhashtable_params = { >> + .key_len = sizeof(unsigned long long), >> + .key_offset = offsetof(struct jbd2_revoke_record_s, blocknr), >> + .head_offset = offsetof(struct jbd2_revoke_record_s, linkage), >> +}; >> + > > I'd probably view your performance results as: "JOURNAL_REVOKE_DEFAULT_HASH > is just too small for replays of a journal with huge numbers of revoked > blocks". Or did you observe that JOURNAL_REVOKE_DEFAULT_HASH is causing > performance issues also during normal operation when we track there revokes > for the current transaction? > > If my interpretation is correct, then rhashtable is unnecessarily huge > hammer for this. Firstly, as the big hash is needed only during replay, > there's no concurrent access to the data structure. Secondly, we just fill > the data structure in the PASS_REVOKE scan and then use it. Thirdly, we > know the number of elements we need to store in the table in advance (well, > currently we don't but it's trivial to modify PASS_SCAN to get that > number). > > So rather than playing with rhashtable, I'd modify PASS_SCAN to sum up > number of revoke records we're going to process and then prepare a static > hash of appropriate size for replay (we can just use the standard hashing > fs/jbd2/revoke.c uses, just with differently sized hash table allocated for > replay and point journal->j_revoke to it). And once recovery completes > jbd2_journal_clear_revoke() can free the table and point journal->j_revoke > back to the original table. What do you think? > Sounds reasonable to me. I'd vote for this solution, this is a really simple and clear solution, and I believe it can achieve similar gains as rhashtable. Thanks, Yi.