On Sun, Oct 20, 2024 at 02:21:50PM -0700, Linus Torvalds wrote: > On Sun, 20 Oct 2024 at 13:54, Linus Torvalds > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > On Sun, 20 Oct 2024 at 13:30, Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote: > > > > > > Latency for journal replay? > > > > No, latency for the journaling itself. > > Side note: latency of the journal replay can actually be quite > critical indeed for any "five nines" operation, and big journals are > not necessarily a good idea for that reason. > > There's a very real reason many places don't use filesystems that do > fsck any more. I need to ask one of the guys with a huge filesystem (if you're listening and have numbers, please chime in), but I don't think journal replay is bad compared to system boot time. At this point it would be completely trivial to do journal replay in the background, after the filesystem is mounted: all we need to do prior to mount is read the journal and sort+dedup the keys, replaying all the updates is the expensive part - but like I mentioned the btree API transparently overlays the journal keys until journal replay is finished, and this was necessary for solving various bootstrap issues. So if someone complains, I'll flip that on and we'll start testing it. Fsck is the real concern, yes, and there's lots to be done there. I have the majority of the work completed for online fsck, but that isn't enough - because if fsck takes a week to complete and it takes most of system capacity while it's running, that's not acceptable either (and that would be the case today if you tried bcachefs on a petabyte filesystem). So for that, we need to be making as many of the consistency checks and repair things that fsck does things that we can do whenever other operations are touching that metadata (and this is mainly what I mean when I mean self healing), and we need to either reduce our dependency on passes that go "walk everything and check references", or add ways to shard them (and only check parts of the filesystem that are suspected to have damage). Checking extent backpointers is the big offender, and fortunately that's the easiest one to fix.