Re: [RFC 0/3] Btrfs checksum offload

Johannes Thumshirn <Johannes.Thumshirn@xxxxxxx> · Wed, 29 Jan 2025 14:55:40 +0000

On 29.01.25 15:13, Kanchan Joshi wrote:
> TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe
> SSD for data checksumming.
> 
> Now, the longer version for why/how.
> 
> End-to-end data protection (E2EDP)-capable drives require the transfer
> of integrity metadata (PI).
> This is currently handled by the block layer, without filesystem
> involvement/awareness.
> The block layer attaches the metadata buffer, generates the checksum
> (and reftag) for write I/O, and verifies it during read I/O.
> 
> Btrfs has its own data and metadata checksumming, which is currently
> disconnected from the above.
> It maintains a separate on-device 'checksum tree' for data checksums,
> while the block layer will also be checksumming each Btrfs I/O.
> 
> There is value in avoiding Copy-on-write (COW) checksum tree on
> a device that can anyway store checksums inline (as part of PI).
> This would eliminate extra checksum writes/reads, making I/O
> more CPU-efficient.
> Additionally, usable space would increase, and write
> amplification, both in Btrfs and eventually at the device level, would
> be reduced [*].
> 
> NVMe drives can also automatically insert and strip the PI/checksum
> and provide a per-I/O control knob (the PRACT bit) for this.
> Block layer currently makes no attempt to know/advertise this offload.
> 
> This patch series: (a) adds checksum offload awareness to the
> block layer (patch #1),
> (b) enables the NVMe driver to register and support the offload
> (patch #2), and
> (c) introduces an opt-in (datasum_offload mount option) in Btrfs to
> apply checksum offload for data (patch #3).

Hi Kanchan,

This is an interesting approach on offloading the checksum work. I've 
only had a quick glance over it with a birds eye view and one thing that 
I noticed is, the missing connection of error reporting between the layers.

For instance if we get a checksum error on btrfs we not only report in 
in dmesg, but also try to repair the affected sector if we do have a 
data profile with redundancy.

So while this patchset offloads the submission side work of the checksum 
tree to the PI code, I don't see the back-propagation of the errors into 
btrfs and the triggering of the repair code.

I get it's a RFC, but as it is now it essentially breaks functionality 
we rely on. Can you add this part as well so we can evaluate the 
patchset not only from the write but also from the read side.

Byte,
	Johannes