Relatively quiet cycle here - which is good, because 6.11 was huge. I think we'll be able to take off EXPERIMENTAL in inside of a year, see below for more. Cheers, Kent The following changes since commit 16005147cca41a0f67b5def2a4656286f8c0db4a: bcachefs: Don't delete open files in online fsck (2024-09-09 09:41:47 -0400) are available in the Git repository at: git://evilpiepirate.org/bcachefs.git tags/bcachefs-2024-09-21 for you to fetch changes up to 025c55a4c7f11ea38521c6e797f3192ad8768c93: bcachefs: return err ptr instead of null in read sb clean (2024-09-21 11:39:49 -0400) ---------------------------------------------------------------- bcachefs changes for 6.12-rc1 rcu_pending, btree key cache rework: this solves lock contenting in the key cache, eliminating the biggest source of the srcu lock hold time warnings, and drastically improving performance on some metadata heavy workloads - on multithreaded creates we're now 3-4x faster than xfs. We're now using an rhashtable instead of the system inode hash table; this is another significant performance improvement on multithreaded metadata workloads, eliminating more lock contention. for_each_btree_key_in_subvolume_upto(): new helper for iterating over keys within a specific subvolume, eliminating a lot of open coded "subvolume_get_snapshot()" and also fixing another source of srcu lock time warnings, by running each loop iteration in its own transaction (as the existing for_each_btree_key() does). More work on btree_trans locking asserts; we now assert that we don't hold btree node locks when trans->locked is false, which is important because we don't use lockdep for tracking individual btree node locks. Some cleanups and improvements in the bset.c btree node lookup code, from Alan. Rework of btree node pinning, which we use in backpointers fsck. The old hacky implementation, where the shrinker just skipped over nodes in the pinned range, was causing OOMs; instead we now use another shrinker with a much higher seeks number for pinned nodes. Rebalance now uses BCH_WRITE_ONLY_SPECIFIED_DEVS; this fixes an issue where rebalance would sometimes fall back to allocating from the full filesystem, which is not what we want when it's trying to move data to a specific target. Use __GFP_ACCOUNT, GFP_RECLAIMABLE for btree node, key cache allocations. Idmap mounts are now supported - Hongbo. Rename whiteouts are now supported - Hongbo. Erasure coding can now handle devices being marked as failed, or forcibly removed. We still need the evacuate path for erasure coding, but it's getting very close to ready for people to start using. Status, and when will we be taking off experimental: ---------------------------------------------------- Going by critical, user facing bugs getting found and fixed, we're nearly there. There are a couple key items that need to be finished before we can take off the experimental label: - The end-user experience is still pretty painful when the root filesystem needs a fsck; we need some form of limited self healing so that necessary repair gets run automatically. Errors (by type) are recorded in the superblock, so what we need to do next is convert remaining inconsistent() errors to fsck() errors (so that all runtime inconsistencies are logged in the superblock), and we need to go through the list of fsck errors and classify them by which fsck passes are needed to repair them. - We need comprehensive torture testing for all our repair paths, to shake out remaining bugs there. Thomas has been working on the tooling for this, so this is coming soonish. Slightly less critical items: - We need to improve the end-user experience for degraded mounts: right now, a degraded root filesystem means dropping to an initramfs shell or somehow inputting mount options manually (we don't want to allow degraded mounts without some form of user input, except on unattended servers) - we need the mount helper to prompt the user to allow mounting degraded, and make sure this works with systemd. - Scalabiity: we have users running 100TB+ filesystems, and that's effectively the limit right now due to fsck times. We have some reworks in the pipeline to address this, we're aiming to make petabyte sized filesystems practical. ---------------------------------------------------------------- Alan Huang (8): bcachefs: Remove unused parameter of bkey_mantissa bcachefs: Remove unused parameter of bkey_mantissa_bits_dropped bcachefs: Remove dead code in __build_ro_aux_tree bcachefs: Convert open-coded extra computation to helper bcachefs: Minimize the search range used to calculate the mantissa bcachefs: Remove the prev array stuff bcachefs: Remove unused parameter bcachefs: Refactor bch2_bset_fix_lookup_table Alyssa Ross (1): bcachefs: Fix negative timespecs Chen Yufan (1): bcachefs: Convert to use jiffies macros Diogo Jahchan Koike (1): bcachefs: return err ptr instead of null in read sb clean Feiko Nanninga (1): bcachefs: Fix sysfs rebalance duration waited formatting Hongbo Li (2): bcachefs: support idmap mounts bcachefs: Fix compilation error for bch2_sb_member_alloc Julian Sun (4): bcachefs: remove the unused macro definition bcachefs: fix macro definition allocate_dropping_locks_errcode bcachefs: fix macro definition allocate_dropping_locks bcachefs: remove the unused parameter in macro bkey_crc_next Kent Overstreet (68): inode: make __iget() a static inline bcachefs: switch to rhashtable for vfs inodes hash bcachefs: Fix deadlock in __wait_on_freeing_inode() lib/generic-radix-tree.c: genradix_ptr_inlined() lib/generic-radix-tree.c: add preallocation bcachefs: rcu_pending bcachefs: rcu_pending now works in userspace bcachefs: Rip out freelists from btree key cache bcachefs: key cache can now allocate from pending bcachefs: data_allowed is now an opts.h option bcachefs: bch2_opt_set_sb() can now set (some) device options bcachefs: Opt_durability can now be set via bch2_opt_set_sb() bcachefs: Add check for btree_path ref overflow bcachefs: Btree path tracepoints bcachefs: kill bch2_btree_iter_peek_and_restart() bcachefs: bchfs_read(): call trans_begin() on every loop iter bcachefs: bch2_fiemap(): call trans_begin() on every loop iter bcachefs: for_each_btree_key_in_subvolume_upto() bcachefs: bch2_readdir() -> for_each_btree_key_in_subvolume_upto bcachefs: bch2_xattr_list() -> for_each_btree_key_in_subvolume_upto bcachefs: bch2_seek_data() -> for_each_btree_key_in_subvolume_upto bcachefs: bch2_seek_hole() -> for_each_btree_key_in_subvolume_upto bcachefs: range_has_data() -> for_each_btree_key_in_subvolume_upto bcachefs: bch2_folio_set() -> for_each_btree_key_in_subvolume_upto bcachefs: quota_reserve_range() -> for_each_btree_key_in_subvolume_upto bcachefs: Move rebalance_status out of sysfs/internal bcachefs: promote_whole_extents is now a normal option bcachefs: trivial open_bucket_add_buckets() cleanup bcachefs: bch2_sb_nr_devices() bcachefs: Drop memalloc_nofs_save() in bch2_btree_node_mem_alloc() bcachefs: bch2_time_stats_reset() bcachefs: Assert that we don't lock nodes when !trans->locked bcachefs: darray: convert to alloc_hooks() bcachefs: Switch gc bucket array to a genradix bcachefs: Add pinned to btree cache not freed counters bcachefs: do_encrypt() now handles allocation failures bcachefs: convert __bch2_encrypt_bio() to darray bcachefs: kill redundant is_vmalloc_addr() bcachefs: fix prototype to bch2_alloc_sectors_start_trans() bcachefs: BCH_WRITE_ALLOC_NOWAIT no longer applies to open bucket allocation bcachefs: rebalance writes use BCH_WRITE_ONLY_SPECIFIED_DEVS bcachefs: Use __GFP_ACCOUNT for reclaimable memory bcachefs: Use mm_account_reclaimed_pages() when freeing btree nodes bcachefs: Options for recovery_passes, recovery_passes_exclude bcachefs: Move tabstop setup to bch2_dev_usage_to_text() bcachefs: bch2_dev_remove_alloc() -> alloc_background.c bcachefs: bch2_sb_member_alloc() bcachefs: improve "no device to read from" message bcachefs: bch2_opts_to_text() bcachefs: Progress indicator for extents_to_backpointers bcachefs: bch2_dev_rcu_noerror() bcachefs: Failed devices no longer require mounting in degraded mode bcachefs: Don't count "skipped access bit" as touched in btree cache scan bcachefs: btree cache counters should be size_t bcachefs: split up btree cache counters for live, freeable bcachefs: Rework btree node pinning bcachefs: EIO errcode cleanup bcachefs: stripe_to_mem() bcachefs: bch_stripe.disk_label bcachefs: ec_stripe_head.nr_created bcachefs: improve bch2_new_stripe_to_text() bcachefs: improve error message on too few devices for ec bcachefs: improve error messages in bch2_ec_read_extent() bcachefs: bch2_trigger_ptr() calculates sectors even when no device bcachefs: bch2_dev_remove_stripes() bcachefs: bch_fs.rw_devs_change_count bcachefs: bch2_ec_stripe_head_get() now checks for change in rw devices bcachefs: Don't drop devices with stripe pointers Matthew Wilcox (Oracle) (1): bcachefs: Do not check folio_has_private() Nathan Chancellor (1): bcachefs: Fix format specifier in bch2_btree_key_cache_to_text() Reed Riley (1): bcachefs: Replace div_u64 with div64_u64 where second param is u64 Sasha Finkelstein (1): bcachefs: Hook up RENAME_WHITEOUT in rename. Thorsten Blum (3): bcachefs: Annotate struct bucket_array with __counted_by() bcachefs: Annotate struct bch_xattr with __counted_by() bcachefs: Annotate bch_replicas_entry_{v0,v1} with __counted_by() Xiaxi Shen (1): bcachefs: Fix a spelling error in docs Yang Li (1): bcachefs: Remove duplicated include in backpointers.c Youling Tang (4): bcachefs: allocate inode by using alloc_inode_sb() bcachefs: Mark bch_inode_info as SLAB_ACCOUNT bcachefs: drop unused posix acl handlers bcachefs: Simplify bch2_xattr_emit() implementation Documentation/filesystems/bcachefs/CodingStyle.rst | 2 +- fs/bcachefs/Kconfig | 7 + fs/bcachefs/Makefile | 1 + fs/bcachefs/acl.c | 2 +- fs/bcachefs/alloc_background.c | 45 +- fs/bcachefs/alloc_background.h | 3 +- fs/bcachefs/alloc_foreground.c | 59 +- fs/bcachefs/alloc_foreground.h | 5 +- fs/bcachefs/backpointers.c | 106 +++- fs/bcachefs/backpointers.h | 23 +- fs/bcachefs/bcachefs.h | 14 +- fs/bcachefs/bcachefs_format.h | 2 + fs/bcachefs/bset.c | 182 +++--- fs/bcachefs/bset.h | 4 +- fs/bcachefs/btree_cache.c | 273 ++++++--- fs/bcachefs/btree_cache.h | 3 + fs/bcachefs/btree_gc.c | 21 +- fs/bcachefs/btree_io.c | 8 +- fs/bcachefs/btree_io.h | 4 +- fs/bcachefs/btree_iter.c | 63 +- fs/bcachefs/btree_iter.h | 52 +- fs/bcachefs/btree_key_cache.c | 405 +++---------- fs/bcachefs/btree_key_cache_types.h | 18 +- fs/bcachefs/btree_locking.h | 13 +- fs/bcachefs/btree_trans_commit.c | 2 +- fs/bcachefs/btree_types.h | 60 +- fs/bcachefs/btree_update.c | 12 +- fs/bcachefs/btree_update_interior.c | 37 +- fs/bcachefs/btree_update_interior.h | 2 + fs/bcachefs/buckets.c | 35 +- fs/bcachefs/buckets.h | 15 +- fs/bcachefs/buckets_types.h | 8 - fs/bcachefs/checksum.c | 101 ++-- fs/bcachefs/clock.h | 9 - fs/bcachefs/darray.c | 4 +- fs/bcachefs/darray.h | 26 +- fs/bcachefs/data_update.c | 2 +- fs/bcachefs/dirent.c | 66 +-- fs/bcachefs/ec.c | 303 +++++++--- fs/bcachefs/ec.h | 11 +- fs/bcachefs/ec_format.h | 9 +- fs/bcachefs/ec_types.h | 1 + fs/bcachefs/errcode.h | 14 +- fs/bcachefs/extents.c | 33 +- fs/bcachefs/extents.h | 24 +- fs/bcachefs/fs-common.c | 5 +- fs/bcachefs/fs-io-buffered.c | 41 +- fs/bcachefs/fs-io-direct.c | 2 +- fs/bcachefs/fs-io-pagecache.c | 90 ++- fs/bcachefs/fs-io-pagecache.h | 4 +- fs/bcachefs/fs-io.c | 178 ++---- fs/bcachefs/fs-ioctl.c | 4 +- fs/bcachefs/fs.c | 427 +++++++++----- fs/bcachefs/fs.h | 18 +- fs/bcachefs/inode.c | 2 +- fs/bcachefs/io_read.c | 18 +- fs/bcachefs/io_write.c | 7 +- fs/bcachefs/journal_io.c | 6 +- fs/bcachefs/journal_reclaim.c | 7 +- fs/bcachefs/opts.c | 85 ++- fs/bcachefs/opts.h | 61 +- fs/bcachefs/rcu_pending.c | 650 +++++++++++++++++++++ fs/bcachefs/rcu_pending.h | 27 + fs/bcachefs/rebalance.c | 3 + fs/bcachefs/recovery.c | 22 +- fs/bcachefs/recovery_passes.c | 10 +- fs/bcachefs/replicas.c | 10 +- fs/bcachefs/replicas_format.h | 9 +- fs/bcachefs/sb-clean.c | 2 +- fs/bcachefs/sb-members.c | 57 ++ fs/bcachefs/sb-members.h | 22 +- fs/bcachefs/str_hash.h | 2 +- fs/bcachefs/subvolume.h | 45 ++ fs/bcachefs/subvolume_types.h | 3 +- fs/bcachefs/super-io.c | 12 +- fs/bcachefs/super.c | 85 +-- fs/bcachefs/sysfs.c | 55 +- fs/bcachefs/thread_with_file.c | 2 +- fs/bcachefs/time_stats.c | 14 + fs/bcachefs/time_stats.h | 3 +- fs/bcachefs/trace.h | 465 ++++++++++++++- fs/bcachefs/util.c | 16 +- fs/bcachefs/util.h | 2 +- fs/bcachefs/xattr.c | 81 +-- fs/bcachefs/xattr_format.h | 2 +- fs/inode.c | 8 - include/linux/fs.h | 9 +- include/linux/generic-radix-tree.h | 105 +++- lib/generic-radix-tree.c | 80 +-- 89 files changed, 3155 insertions(+), 1690 deletions(-) create mode 100644 fs/bcachefs/rcu_pending.c create mode 100644 fs/bcachefs/rcu_pending.h