TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe SSD for data checksumming. Now, the longer version for why/how. End-to-end data protection (E2EDP)-capable drives require the transfer of integrity metadata (PI). This is currently handled by the block layer, without filesystem involvement/awareness. The block layer attaches the metadata buffer, generates the checksum (and reftag) for write I/O, and verifies it during read I/O. Btrfs has its own data and metadata checksumming, which is currently disconnected from the above. It maintains a separate on-device 'checksum tree' for data checksums, while the block layer will also be checksumming each Btrfs I/O. There is value in avoiding Copy-on-write (COW) checksum tree on a device that can anyway store checksums inline (as part of PI). This would eliminate extra checksum writes/reads, making I/O more CPU-efficient. Additionally, usable space would increase, and write amplification, both in Btrfs and eventually at the device level, would be reduced [*]. NVMe drives can also automatically insert and strip the PI/checksum and provide a per-I/O control knob (the PRACT bit) for this. Block layer currently makes no attempt to know/advertise this offload. This patch series: (a) adds checksum offload awareness to the block layer (patch #1), (b) enables the NVMe driver to register and support the offload (patch #2), and (c) introduces an opt-in (datasum_offload mount option) in Btrfs to apply checksum offload for data (patch #3). [*] Here are some perf/write-amplification numbers from randwrite test [1] on 3 configs (same device): Config 1: No meta format (4K) + Btrfs (base) Config 2: Meta format (4K + 8b) + Btrfs (base) Config 3: Meta format (4K + 8b) + Btrfs (datasum_offload) In config 1 and 2, Btrfs will operate with a checksum tree. Only in config 2, block-layer will attach integrity buffer with each I/O and do checksum/reftag verification. Only in config 3, offload will take place and device will generate/verify the checksum. AppW: writes issued by app, 120G (4 Jobs, each writing 30G) FsW: writes issued to device (from iostat) ExtraW: extra writes compared to AppW Direct I/O --------------------------------------------------------- Config IOPS(K) FsW(G) ExtraW(G) 1 144 186 66 2 141 181 61 3 172 129 9 Buffered I/O --------------------------------------------------------- Config IOPS(K) FsW(G) ExtraW(G) 1 82 255 135 2 80 181 132 3 100 199 79 Write amplification is generally high (and that's understandable given B-trees) but not sure why buffered I/O shows that much. [1] fio --name=btrfswrite --ioengine=io_uring --directory=/mnt --blocksize=4k --readwrite=randwrite --filesize=30G --numjobs=4 --iodepth=32 --randseed=0 --direct=1 -output=out --group_reporting Kanchan Joshi (3): block: add integrity offload nvme: support integrity offload btrfs: add checksum offload block/bio-integrity.c | 42 ++++++++++++++++++++++++++++++++++++++- block/t10-pi.c | 7 +++++++ drivers/nvme/host/core.c | 24 ++++++++++++++++++++++ drivers/nvme/host/nvme.h | 1 + fs/btrfs/bio.c | 12 +++++++++++ fs/btrfs/fs.h | 1 + fs/btrfs/super.c | 9 +++++++++ include/linux/blk_types.h | 3 +++ include/linux/blkdev.h | 7 +++++++ 9 files changed, 105 insertions(+), 1 deletion(-) -- 2.25.1