On 29/1/25 14:02, Kanchan Joshi wrote: > > > TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe > SSD for data checksumming. > > Now, the longer version for why/how. > > End-to-end data protection (E2EDP)-capable drives require the transfer > of integrity metadata (PI). > This is currently handled by the block layer, without filesystem > involvement/awareness. > The block layer attaches the metadata buffer, generates the checksum > (and reftag) for write I/O, and verifies it during read I/O. > > Btrfs has its own data and metadata checksumming, which is currently > disconnected from the above. > It maintains a separate on-device 'checksum tree' for data checksums, > while the block layer will also be checksumming each Btrfs I/O. > > There is value in avoiding Copy-on-write (COW) checksum tree on > a device that can anyway store checksums inline (as part of PI). > This would eliminate extra checksum writes/reads, making I/O > more CPU-efficient. > Additionally, usable space would increase, and write > amplification, both in Btrfs and eventually at the device level, would > be reduced [*]. > > NVMe drives can also automatically insert and strip the PI/checksum > and provide a per-I/O control knob (the PRACT bit) for this. > Block layer currently makes no attempt to know/advertise this offload. > > This patch series: (a) adds checksum offload awareness to the > block layer (patch #1), > (b) enables the NVMe driver to register and support the offload > (patch #2), and > (c) introduces an opt-in (datasum_offload mount option) in Btrfs to > apply checksum offload for data (patch #3). > > [*] Here are some perf/write-amplification numbers from randwrite test [1] > on 3 configs (same device): > Config 1: No meta format (4K) + Btrfs (base) > Config 2: Meta format (4K + 8b) + Btrfs (base) > Config 3: Meta format (4K + 8b) + Btrfs (datasum_offload) > > In config 1 and 2, Btrfs will operate with a checksum tree. > Only in config 2, block-layer will attach integrity buffer with each I/O and > do checksum/reftag verification. > Only in config 3, offload will take place and device will generate/verify > the checksum. > > AppW: writes issued by app, 120G (4 Jobs, each writing 30G) > FsW: writes issued to device (from iostat) > ExtraW: extra writes compared to AppW > > Direct I/O > --------------------------------------------------------- > Config IOPS(K) FsW(G) ExtraW(G) > 1 144 186 66 > 2 141 181 61 > 3 172 129 9 > > Buffered I/O > --------------------------------------------------------- > Config IOPS(K) FsW(G) ExtraW(G) > 1 82 255 135 > 2 80 181 132 > 3 100 199 79 > > Write amplification is generally high (and that's understandable given > B-trees) but not sure why buffered I/O shows that much. > > [1] fio --name=btrfswrite --ioengine=io_uring --directory=/mnt --blocksize=4k --readwrite=randwrite --filesize=30G --numjobs=4 --iodepth=32 --randseed=0 --direct=1 -output=out --group_reporting > > > Kanchan Joshi (3): > block: add integrity offload > nvme: support integrity offload > btrfs: add checksum offload > > block/bio-integrity.c | 42 ++++++++++++++++++++++++++++++++++++++- > block/t10-pi.c | 7 +++++++ > drivers/nvme/host/core.c | 24 ++++++++++++++++++++++ > drivers/nvme/host/nvme.h | 1 + > fs/btrfs/bio.c | 12 +++++++++++ > fs/btrfs/fs.h | 1 + > fs/btrfs/super.c | 9 +++++++++ > include/linux/blk_types.h | 3 +++ > include/linux/blkdev.h | 7 +++++++ > 9 files changed, 105 insertions(+), 1 deletion(-) > There's also checksumming done on the metadata trees, which could be avoided if we're trusting the block device to do it. Maybe rather than putting this behind a new compat flag, add a new csum type of "none"? With the logic being that it also zeroes out the csum field in the B-tree headers. Mark