Re: [PATCH V2] jbd2: use rhashtable for revoke records during replay

Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> · Sat, 9 Nov 2024 11:12:30 +0800

On 2024/11/8 18:33, Jan Kara wrote:
> On Tue 05-11-24 14:44:28, Li Dongyang wrote:
>> Resizable hashtable should improve journal replay time when
>> we have million of revoke records.
>> Notice that rhashtable is used during replay only,
>> as removal with list_del() is less expensive and it's still used
>> during regular processing.
>>
>> before:
>> 1048576 records - 95 seconds
>> 2097152 records - 580 seconds
> 
> These are really high numbers of revoke records. Deleting couple GB of
> metadata doesn't happen so easily. Are they from a real workload or just
> a stress test?
>  
>> after:
>> 1048576 records - 2 seconds
>> 2097152 records - 3 seconds
>> 4194304 records - 7 seconds
> 
> The gains are very nice :).
> 
>> Signed-off-by: Alex Zhuravlev <bzzz@xxxxxxxxxxxxx>
>> Signed-off-by: Li Dongyang <dongyangli@xxxxxxx>
> 
>> diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
>> index 667f67342c52..d9287439171c 100644
>> --- a/fs/jbd2/recovery.c
>> +++ b/fs/jbd2/recovery.c
>> @@ -294,6 +294,10 @@ int jbd2_journal_recover(journal_t *journal)
>>  	memset(&info, 0, sizeof(info));
>>  	sb = journal->j_superblock;
>>  
>> +	err = jbd2_journal_init_recovery_revoke(journal);
>> +	if (err)
>> +		return err;
>> +
>>  	/*
>>  	 * The journal superblock's s_start field (the current log head)
>>  	 * is always zero if, and only if, the journal was cleanly
>> diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
>> index 4556e4689024..d6e96099e9c9 100644
>> --- a/fs/jbd2/revoke.c
>> +++ b/fs/jbd2/revoke.c
>> @@ -90,6 +90,7 @@
>>  #include <linux/bio.h>
>>  #include <linux/log2.h>
>>  #include <linux/hash.h>
>> +#include <linux/rhashtable.h>
>>  #endif
>>  
>>  static struct kmem_cache *jbd2_revoke_record_cache;
>> @@ -101,7 +102,10 @@ static struct kmem_cache *jbd2_revoke_table_cache;
>>  
>>  struct jbd2_revoke_record_s
>>  {
>> -	struct list_head  hash;
>> +	union {
>> +		struct list_head  hash;
>> +		struct rhash_head linkage;
>> +	};
>>  	tid_t		  sequence;	/* Used for recovery only */
>>  	unsigned long long	  blocknr;
>>  };
>> @@ -680,13 +684,22 @@ static void flush_descriptor(journal_t *journal,
>>   * single block.
>>   */
>>  
>> +static const struct rhashtable_params revoke_rhashtable_params = {
>> +	.key_len     = sizeof(unsigned long long),
>> +	.key_offset  = offsetof(struct jbd2_revoke_record_s, blocknr),
>> +	.head_offset = offsetof(struct jbd2_revoke_record_s, linkage),
>> +};
>> +
> 
> I'd probably view your performance results as: "JOURNAL_REVOKE_DEFAULT_HASH
> is just too small for replays of a journal with huge numbers of revoked
> blocks". Or did you observe that JOURNAL_REVOKE_DEFAULT_HASH is causing
> performance issues also during normal operation when we track there revokes
> for the current transaction?
> 
> If my interpretation is correct, then rhashtable is unnecessarily huge
> hammer for this. Firstly, as the big hash is needed only during replay,
> there's no concurrent access to the data structure. Secondly, we just fill
> the data structure in the PASS_REVOKE scan and then use it. Thirdly, we
> know the number of elements we need to store in the table in advance (well,
> currently we don't but it's trivial to modify PASS_SCAN to get that
> number). 
> 
> So rather than playing with rhashtable, I'd modify PASS_SCAN to sum up
> number of revoke records we're going to process and then prepare a static
> hash of appropriate size for replay (we can just use the standard hashing
> fs/jbd2/revoke.c uses, just with differently sized hash table allocated for
> replay and point journal->j_revoke to it). And once recovery completes
> jbd2_journal_clear_revoke() can free the table and point journal->j_revoke
> back to the original table. What do you think?
> 

Sounds reasonable to me. I'd vote for this solution, this is a really simple
and clear solution, and I believe it can achieve similar gains as rhashtable.

Thanks,
Yi.