Re: [PATCH V2] jbd2: use rhashtable for revoke records during replay

Jan Kara <jack@xxxxxxx> · Wed, 13 Nov 2024 15:47:52 +0100

On Tue 12-11-24 11:44:11, Andreas Dilger wrote:
> On Nov 8, 2024, at 9:11 AM, Theodore Ts'o <tytso@xxxxxxx> wrote:
> > 
> > On Fri, Nov 08, 2024 at 11:33:58AM +0100, Jan Kara wrote:
> >>> 1048576 records - 95 seconds
> >>> 2097152 records - 580 seconds
> >> 
> >> These are really high numbers of revoke records. Deleting couple GB of
> >> metadata doesn't happen so easily. Are they from a real workload or just
> >> a stress test?
> > 
> > For context, the background of this is that this has been an
> > out-of-tree that's been around for a very long time, for use with
> > Lustre servers where apparently, this very large number of revoke
> > records is a real thing.
> 
> Yes, we've seen this in production if there was a crash after deleting
> many millions of log records.  This causes remount to take potentially
> several hours before completing (and this was made worse by HA causing
> failovers due to mount being "stuck" doing the journal replay).

Thanks for clarification!

> >> If my interpretation is correct, then rhashtable is unnecessarily
> >> huge hammer for this. Firstly, as the big hash is needed only during
> >> replay, there's no concurrent access to the data
> >> structure. Secondly, we just fill the data structure in the
> >> PASS_REVOKE scan and then use it. Thirdly, we know the number of
> >> elements we need to store in the table in advance (well, currently
> >> we don't but it's trivial to modify PASS_SCAN to get that number).
> >> 
> >> So rather than playing with rhashtable, I'd modify PASS_SCAN to sum
> >> up number of revoke records we're going to process and then prepare
> >> a static hash of appropriate size for replay (we can just use the
> >> standard hashing fs/jbd2/revoke.c uses, just with differently sized
> >> hash table allocated for replay and point journal->j_revoke to
> >> it). And once recovery completes jbd2_journal_clear_revoke() can
> >> free the table and point journal->j_revoke back to the original
> >> table. What do you think?
> > 
> > Hmm, that's a really nice idea; Andreas, what do you think?
> 
> Implementing code to manually count and resize the recovery hashtable
> will also have its own complexity, including possible allocation size
> limits for a huge hash table.  That could be worked around by kvmalloc(),
> but IMHO this essentially starts "open coding" something rhashtable was
> exactly designed to avoid.

Well, I'd say the result is much simpler than rhashtable code since you
don't need all that dynamic reallocation and complex locking. Attached is a
patch that implements my suggestion. I'd say it is simpler than having two
types of revoke block hashing depending on whether we are doing recovery or
running the journal. I've tested it and it seems to work fine (including
replay of a journal with sufficiently many revoke blocks) but I'm not sure
I can do a meaningful performance testing (I cannot quite reproduce the
slow replay times even when shutting down the filesystem after deleting
1000000 directories). So can you please give it a spin?

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR