On Wed 15-01-25 17:08:03, Andreas Dilger wrote: > On Nov 13, 2024, at 7:47 AM, Jan Kara <jack@xxxxxxx> wrote: > > > > On Tue 12-11-24 11:44:11, Andreas Dilger wrote: > >> On Nov 8, 2024, at 9:11 AM, Theodore Ts'o <tytso@xxxxxxx> wrote: > >>> > >>> On Fri, Nov 08, 2024 at 11:33:58AM +0100, Jan Kara wrote: > >>>>> 1048576 records - 95 seconds > >>>>> 2097152 records - 580 seconds > >>>> > >>>> These are really high numbers of revoke records. Deleting couple GB of > >>>> metadata doesn't happen so easily. Are they from a real workload or just > >>>> a stress test? > >>> > >>> For context, the background of this is that this has been an > >>> out-of-tree that's been around for a very long time, for use with > >>> Lustre servers where apparently, this very large number of revoke > >>> records is a real thing. > >> > >> Yes, we've seen this in production if there was a crash after deleting > >> many millions of log records. This causes remount to take potentially > >> several hours before completing (and this was made worse by HA causing > >> failovers due to mount being "stuck" doing the journal replay). > > > > Thanks for clarification! > > > >>>> If my interpretation is correct, then rhashtable is unnecessarily > >>>> huge hammer for this. Firstly, as the big hash is needed only during > >>>> replay, there's no concurrent access to the data > >>>> structure. Secondly, we just fill the data structure in the > >>>> PASS_REVOKE scan and then use it. Thirdly, we know the number of > >>>> elements we need to store in the table in advance (well, currently > >>>> we don't but it's trivial to modify PASS_SCAN to get that number). > >>>> > >>>> So rather than playing with rhashtable, I'd modify PASS_SCAN to sum > >>>> up number of revoke records we're going to process and then prepare > >>>> a static hash of appropriate size for replay (we can just use the > >>>> standard hashing fs/jbd2/revoke.c uses, just with differently sized > >>>> hash table allocated for replay and point journal->j_revoke to > >>>> it). And once recovery completes jbd2_journal_clear_revoke() can > >>>> free the table and point journal->j_revoke back to the original > >>>> table. What do you think? > >>> > >>> Hmm, that's a really nice idea; Andreas, what do you think? > >> > >> Implementing code to manually count and resize the recovery hashtable > >> will also have its own complexity, including possible allocation size > >> limits for a huge hash table. That could be worked around by kvmalloc(), > >> but IMHO this essentially starts "open coding" something rhashtable was > >> exactly designed to avoid. > > > > Well, I'd say the result is much simpler than rhashtable code since > > you don't need all that dynamic reallocation and complex locking. Attached is a patch that implements my suggestion. I'd say it is > > simpler than having two types of revoke block hashing depending on > > whether we are doing recovery or running the journal. > > > > I've tested it and it seems to work fine (including replay of a > > journal with sufficiently many revoke blocks) but I'm not sure > > I can do a meaningful performance testing (I cannot quite reproduce > > the slow replay times even when shutting down the filesystem after > > deleting 1000000 directories). So can you please give it a spin? > > Alex posted test results on the other rhashtable revoke thread, > which show both the rhashtable and Jan's dynamically-allocated hash > table perform much better than the original fixed-size hash table. Yes, thanks for the testing! Patch submitted. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR