[LSF/MM/BPF TOPIC] bcachefs - how do we get to the next gen filesystem we all want - development process

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Fri, 17 Jan 2025 21:05:45 -0500

First, a short update on develpment:

- Experimental should be coming off in ~6 months. This will mean
  stronger backwards compatibility guarantees (no more forced on disk
  format upgrades) and more backports. The main criteria for taking the
  experimental label off will be going a cycle without any critical bug
  reports, and making sure we have all the critical on disk format
  changes we want (i.e. things that'll be a hassle to fix later).

- Major developments:
  - Next merge window pull request pushes practical scalability limits
    to ~50 PB. With the recent changes to address backpointers fsck, we
    fsck'd a filesystem with 10 PB of data in an hour and a half.

  - Online self healing is now the default, and more and more codepaths
    are being converted so that we can fix errors at runtime that would
    otherwise require fsck to correct.

    The goal is that users should never have to explicitly run fsck -
    even in extreme disaster scenarios - and we're pretty far along.

  - Online fsck still has a ways to go, but the locking issues for the
    main check_allocations pass are nearly sorted so we're past the hard
    technical hurdles. This is a high priority item.

  - Scrub will be landing in the next few weeks, or sooner.

  - Stabilization: "user is not able to access filesystem" bugs reports
    have essentially stopped; fsck is looking quite robust. Overall
    stability going by bug reports and user feedback is shaping up
    quickly.

    We do have reports of outstanding data corruption (nix package
    builders are providing excellent torture testing), and some severe
    performance bugs to address; these are the highest priority
    outstanding items.

    There are good things to report on the performance front: one tidbit
    is that users are reporting that in situations where btrfs falls
    over without nocow mode (databases, bittorrent) bcachefs does fine,
    and "bcachefs doesn't need" nocow is now the common advice to new
    users.

Now that that's out of the way - my plan this year isn't to be talking
about code, but rather development process - and the need to get
organized.

First, we need to talk about the historical track record with filesystem
devolpment. Historically, most filesystem efforts have failed - and we
ought to see what lessons we can learn.

By my count, in the past few decades there have been 3 "next generation"
filesystems that advanced the state of the art and actually delivered on
their goals:
 - XFS
 - ZFS
 - and most recently, APFS

Pointedly, none of these came from the open source community. While I
would give ext3 half credit - it was a pragmatic approach that did
deliver on its goals - the ext4 codebase also showcases some of the
disadvantages of our "community based" approach.

This isn't an open vs. closed source thing, Microsoft also failed with
ntfs replacement; filesystems are hard. If there's a single overarching
diagnosis to be made, I would say the issue is organizational: doing a
filesystem right requires funding a team consistently for many years
with the right kind of long term focus.

The most successful efforts came from the big Unix vendors, when that
was still a thing, and now from Apple, who is known for being able to
organize and support engineering teams.

All this is to say that I'd like for us to be able to set some long term
priorities in the filesystem space, decide what we need to push for, and
figure out how to get it done. The Linux kernel world is not poorly
funded, but efforts don't get funded without a plan, and historically
our filesysystem devolpment has suffered from a short term "project
manager" type focus - a lot of effort being spent on individual highly
niche features for customers with deep pockets, while bread and butter
stuff gets neglected.

-----------------

Here's my list of things we actually do need:

Process, tooling:
-----------------

- Firstly, a filesystem is not just the code itself. It's the tooling,
  the test infrastructure, the time spent working with users who are
  digging in and finding out what works and what doesn't: it is
  _whatever it takes to get the job done_.

  I've often heard talk from engineers who think of tooling as something
  "other people work on", or corporate types who don't want to work with
  users because "that's unpaid user support": but we don't get this done
  without a community, and that includes the _user_ community, and
  developers who aren't kernel developers.

  We need to be leveraging all the resources we have, and we need to be
  bringing the right attitude if we want to deliver the best work we
  can.

  Some specific things that I see still lacking, within the bcachefs
  world and the filesystem world as a whole:

  - Our testing automation still needs to be better. I've built
    developer focused testing automation, but it still needs work and I
    could use help.

    - We badly need automated performance testing. I still see people at
      the bigcorps doing performance testing manually, and what
      automated performance testing their is lives in the basements of
      certain engineers. This needs to be a standard thing we're all
      using.

    - Code coverage analysis needs to be a standard thing we're all
      looking at - it should be something we can trivially glance at
      when we're doing code review.

      (If anyone wants to help with this one, there's some trivial
      makefile work that needs to happen and my test infrastructure has
      the rest implemented).

    - bcachefs still needs real methodical automated error injection
      testing (I know XFS has this, props to them); we won't be able to
      consider fsck rock solid until this is done.

Technical milestones:
---------------------

bcachefs has achieved nearly all of the critical technical milestones
I've laid out - or they're far enough along that we're past the "how
well will this work" uncertainty. But here's my criteria for any major
next gen filesystem:

- Check/repair:

  Contrary to what certain people have voiced about "filesystems that
  don't need fsck", fsck will _always_ be a critical component of any
  filesystem - shit happens, and we need to be able to recover, and
  check if the system is in a consistent state (else many bugs will go
  undiscovered).

  Data loss is flatly unacceptable in any filesystem suitable for real
  usage: I do not care what happened or how a filesystem was corrupted,
  if there is data still present it is our job to recover it and get the
  system back to a working state.

  Additionally, fsck is _the_ scalability limitation as systems continue
  to grow. Inherently so, as there are many global invariants/references
  to be checked.

  As mentioned, bcachefs is now well into the petabyte range, which
  should be good for a bit - for most users. Long term, we're going to
  need allocation groups so that we can efficiently shard the main fsck
  passes; allocation groups will get bcachefs into the exabyte range.

- Self healing, online fsck:

  Having the filesystem be offline for check/repair is also becoming a
  non-option, so anything that can be repaired at runtime must be - and
  we need to have mechanisms for getting the system up and running in RW
  mode and doing fsck online even when the filesystem is corrupt.

  (bcachefs has this covered, naturally).

- Scalability

  Besides just the size of the dataset itself, large systems with large
  numbers of drives and complex topologies need to be supported. These
  systems exist, and today the methods of managing those large number of
  drives are lacking; we can do better.

- Versioning, upgrade and downgrade flexibility and compatibility w.r.t.
  on disk format.

  A common complaint from users is being stuck on an old on disk format
  version, without even being aware, and being subject to bugs/issues
  which have been long since fixed.

  We need a better story for on disk format changes. bcachefs also has
  this one covered; while in the experimental phase we've been making
  extensive use of our ability to roll out new on disk format versions
  automatically and seamlessly _and still support downgrades_.

- Real database fundamentals.

  Filesystems are databases, and if we steal as much as possible of the
  programming model from the database world it becomes drastically
  easier to roll out features and improvements; our code becomes more
  flexible and compatibility becomes much easier.

What do people want, and how do we get organized?
-------------------------------------------------

This part will be dependent on participation, naturally. It's all up to
us, the engineers :)

I'm hoping to get more community involvement from developers this year.
I want to see this thing be as big a success as it deserves to be, and I
want users to have something they can depend on - peace of mind.

And I want this filesystem to be a place where people can bring their
best ideas and see them to fruition. There's still interesting research
to be done and interesting problems to be solved.

Let's see what we can make happen...