Re: [PATCH] mm: Drop INT_MAX limit from kvmalloc()

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Sun, 20 Oct 2024 17:40:55 -0400

On Sun, Oct 20, 2024 at 02:21:50PM -0700, Linus Torvalds wrote:
> On Sun, 20 Oct 2024 at 13:54, Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > On Sun, 20 Oct 2024 at 13:30, Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote:
> > >
> > > Latency for journal replay?
> >
> > No, latency for the journaling itself.
> 
> Side note: latency of the journal replay can actually be quite
> critical indeed for any "five nines" operation, and big journals are
> not necessarily a good idea for that reason.
> 
> There's a very real reason many places don't use filesystems that do
> fsck any more.

I need to ask one of the guys with a huge filesystem (if you're
listening and have numbers, please chime in), but I don't think journal
replay is bad compared to system boot time.

At this point it would be completely trivial to do journal replay in the
background, after the filesystem is mounted: all we need to do prior to
mount is read the journal and sort+dedup the keys, replaying all the
updates is the expensive part - but like I mentioned the btree API
transparently overlays the journal keys until journal replay is
finished, and this was necessary for solving various bootstrap issues.
So if someone complains, I'll flip that on and we'll start testing it.

Fsck is the real concern, yes, and there's lots to be done there. I have
the majority of the work completed for online fsck, but that isn't
enough - because if fsck takes a week to complete and it takes most of
system capacity while it's running, that's not acceptable either (and
that would be the case today if you tried bcachefs on a petabyte
filesystem).

So for that, we need to be making as many of the consistency checks and
repair things that fsck does things that we can do whenever other
operations are touching that metadata (and this is mainly what I mean
when I mean self healing), and we need to either reduce our dependency
on passes that go "walk everything and check references", or add ways to
shard them (and only check parts of the filesystem that are suspected to
have damage). Checking extent backpointers is the big offender, and
fortunately that's the easiest one to fix.