here it is; the disk accounting rewrite I've been talking about since forever. git link: https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-disk-accounting-rewrite test dashboard (just rebased, results are regenerating as of this writing but shouldn't be any regressions left): https://evilpiepirate.org/~testdashboard/ci?branch=bcachefs-disk-accounting-rewrite The old disk accounting scheme was fast, but had some limitations: - lack of scalability: it was based on percpu counters additionally sharded by outstanding journal buffer, and then just prior to journal write we'd roll up the counters and add them to the journal entry. But this meant that all counters were added to every journal write, which meant it'd never be able to support per-snapshot counters. - it was a pain to extend this was why, until now, we didn't have proper compressed accounting, and getting compression ratio required a full btree scan In the new scheme: - every set of counters is a bkey, a key in a btree (BTREE_ID_accounting). this means they aren't pinned in the journal - the key has structure, and is extensible disk_accounting_key is a tagged union, and it's just union'd over bpos - counters are deltas, until flushed to the underlying btree this means counter updates are normal btree updates; the btree write buffer makes counter updates efficient. Since reading counters from the btree would be expensive - it'd require a write buffer flush to get up-to-date counters - we also maintain a parallel set of accounting in memory, a bit like the old scheme but without the per-journal-buffer sharding. The in memory accounters indexed in an eytzinger tree by disk_accounting_key/bpos, with the counters themselves being percpu u64s. Reviewers: do a "is this adequately documented, can I find my way around, do things make sense", not line-by-line "does this have bugs". Compatibility: this is in no way compatible with the old disk accounting on disk format, and it's not feasible to write out accounting in the old format - that means we have to regenerate accounting when upgrading or downgrading past this version. That should work more or less seamlessly with the most recent compat bits (bch_sb_field downgrade, so we can tell older versions what recovery psases to run and what to fix); additionally, userspace fsck now checks if the kernel bcachefs version better matches the on disk version than itself and if so uses the kernle fsck implementation with the OFFLINE_FSCK ioctl - so we shouldn't be bouncing back and forth between versions if your tools and kernel don't match. upgrade/downgrade still need a bit more testing, but transparently using kernel fsck is well tested as of latest versions. but: 6.7 users (& possibly 6.8) beware, the sb_downgrade section is in 6.7 but BCH_IOCTL_OFFLINE_FSCK is not, and backporting that doesn't look likely given current -stable process fiasco. merge ETA - this stuff may make the next merge window; I'd like to get per-snapshot-id accounting done with it, that should be the biggest item left. Cheers, Kent Kent Overstreet (21): bcachefs: KEY_TYPE_accounting bcachefs: Accumulate accounting keys in journal replay bcachefs: btree write buffer knows how to accumulate bch_accounting keys bcachefs: Disk space accounting rewrite bcachefs: dev_usage updated by new accounting bcachefs: Kill bch2_fs_usage_initialize() bcachefs: Convert bch2_ioctl_fs_usage() to new accounting bcachefs: kill bch2_fs_usage_read() bcachefs: Kill writing old accounting to journal bcachefs: Delete journal-buf-sharded old style accounting bcachefs: Kill bch2_fs_usage_to_text() bcachefs: Kill fs_usage_online bcachefs: Kill replicas_journal_res bcachefs: Convert gc to new accounting bcachefs: Convert bch2_replicas_gc2() to new accounting bcachefs: bch2_verify_accounting_clean() bcachefs: Eytzinger accumulation for accounting keys bcachefs: bch_acct_compression bcachefs: Convert bch2_compression_stats_to_text() to new accounting bcachefs: bch2_fs_accounting_to_text() bcachefs: bch2_fs_usage_base_to_text() fs/bcachefs/Makefile | 3 +- fs/bcachefs/alloc_background.c | 137 +++-- fs/bcachefs/alloc_background.h | 2 + fs/bcachefs/bcachefs.h | 22 +- fs/bcachefs/bcachefs_format.h | 81 +-- fs/bcachefs/bcachefs_ioctl.h | 7 +- fs/bcachefs/bkey_methods.c | 1 + fs/bcachefs/btree_gc.c | 259 ++++------ fs/bcachefs/btree_iter.c | 9 - fs/bcachefs/btree_journal_iter.c | 23 +- fs/bcachefs/btree_journal_iter.h | 15 + fs/bcachefs/btree_trans_commit.c | 71 ++- fs/bcachefs/btree_types.h | 1 - fs/bcachefs/btree_update.h | 22 +- fs/bcachefs/btree_write_buffer.c | 120 ++++- fs/bcachefs/btree_write_buffer.h | 50 +- fs/bcachefs/btree_write_buffer_types.h | 2 + fs/bcachefs/buckets.c | 663 ++++--------------------- fs/bcachefs/buckets.h | 70 +-- fs/bcachefs/buckets_types.h | 14 +- fs/bcachefs/chardev.c | 75 +-- fs/bcachefs/disk_accounting.c | 584 ++++++++++++++++++++++ fs/bcachefs/disk_accounting.h | 203 ++++++++ fs/bcachefs/disk_accounting_format.h | 145 ++++++ fs/bcachefs/disk_accounting_types.h | 20 + fs/bcachefs/ec.c | 166 ++++--- fs/bcachefs/inode.c | 42 +- fs/bcachefs/journal_io.c | 13 +- fs/bcachefs/recovery.c | 126 +++-- fs/bcachefs/recovery_types.h | 1 + fs/bcachefs/replicas.c | 242 ++------- fs/bcachefs/replicas.h | 16 +- fs/bcachefs/replicas_format.h | 21 + fs/bcachefs/replicas_types.h | 16 - fs/bcachefs/sb-clean.c | 62 --- fs/bcachefs/sb-downgrade.c | 12 +- fs/bcachefs/sb-errors_types.h | 4 +- fs/bcachefs/super.c | 74 ++- fs/bcachefs/sysfs.c | 109 ++-- 39 files changed, 1873 insertions(+), 1630 deletions(-) create mode 100644 fs/bcachefs/disk_accounting.c create mode 100644 fs/bcachefs/disk_accounting.h create mode 100644 fs/bcachefs/disk_accounting_format.h create mode 100644 fs/bcachefs/disk_accounting_types.h create mode 100644 fs/bcachefs/replicas_format.h -- 2.43.0