Re: [PATCH V2] jbd2: use rhashtable for revoke records during replay

Jan Kara <jack@xxxxxxx> · Thu, 16 Jan 2025 19:04:08 +0100

On Wed 15-01-25 17:08:03, Andreas Dilger wrote:
> On Nov 13, 2024, at 7:47 AM, Jan Kara <jack@xxxxxxx> wrote:
> > 
> > On Tue 12-11-24 11:44:11, Andreas Dilger wrote:
> >> On Nov 8, 2024, at 9:11 AM, Theodore Ts'o <tytso@xxxxxxx> wrote:
> >>> 
> >>> On Fri, Nov 08, 2024 at 11:33:58AM +0100, Jan Kara wrote:
> >>>>> 1048576 records - 95 seconds
> >>>>> 2097152 records - 580 seconds
> >>>> 
> >>>> These are really high numbers of revoke records. Deleting couple GB of
> >>>> metadata doesn't happen so easily. Are they from a real workload or just
> >>>> a stress test?
> >>> 
> >>> For context, the background of this is that this has been an
> >>> out-of-tree that's been around for a very long time, for use with
> >>> Lustre servers where apparently, this very large number of revoke
> >>> records is a real thing.
> >> 
> >> Yes, we've seen this in production if there was a crash after deleting
> >> many millions of log records.  This causes remount to take potentially
> >> several hours before completing (and this was made worse by HA causing
> >> failovers due to mount being "stuck" doing the journal replay).
> > 
> > Thanks for clarification!
> > 
> >>>> If my interpretation is correct, then rhashtable is unnecessarily
> >>>> huge hammer for this. Firstly, as the big hash is needed only during
> >>>> replay, there's no concurrent access to the data
> >>>> structure. Secondly, we just fill the data structure in the
> >>>> PASS_REVOKE scan and then use it. Thirdly, we know the number of
> >>>> elements we need to store in the table in advance (well, currently
> >>>> we don't but it's trivial to modify PASS_SCAN to get that number).
> >>>> 
> >>>> So rather than playing with rhashtable, I'd modify PASS_SCAN to sum
> >>>> up number of revoke records we're going to process and then prepare
> >>>> a static hash of appropriate size for replay (we can just use the
> >>>> standard hashing fs/jbd2/revoke.c uses, just with differently sized
> >>>> hash table allocated for replay and point journal->j_revoke to
> >>>> it). And once recovery completes jbd2_journal_clear_revoke() can
> >>>> free the table and point journal->j_revoke back to the original
> >>>> table. What do you think?
> >>> 
> >>> Hmm, that's a really nice idea; Andreas, what do you think?
> >> 
> >> Implementing code to manually count and resize the recovery hashtable
> >> will also have its own complexity, including possible allocation size
> >> limits for a huge hash table.  That could be worked around by kvmalloc(),
> >> but IMHO this essentially starts "open coding" something rhashtable was
> >> exactly designed to avoid.
> > 
> > Well, I'd say the result is much simpler than rhashtable code since
> > you don't need all that dynamic reallocation and complex locking. Attached is a patch that implements my suggestion. I'd say it is
> > simpler than having two types of revoke block hashing depending on
> > whether we are doing recovery or running the journal.
> > 
> > I've tested it and it seems to work fine (including replay of a
> > journal with sufficiently many revoke blocks) but I'm not sure
> > I can do a meaningful performance testing (I cannot quite reproduce
> > the slow replay times even when shutting down the filesystem after
> > deleting 1000000 directories). So can you please give it a spin?
> 
> Alex posted test results on the other rhashtable revoke thread,
> which show both the rhashtable and Jan's dynamically-allocated hash
> table perform much better than the original fixed-size hash table.

Yes, thanks for the testing! Patch submitted.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR