Re: [RFC 0/3] Btrfs checksum offload

"Martin K. Petersen" <martin.petersen@xxxxxxxxxx> · Thu, 30 Jan 2025 15:21:45 -0500

Hi Kanchan!

> There is value in avoiding Copy-on-write (COW) checksum tree on a
> device that can anyway store checksums inline (as part of PI). This
> would eliminate extra checksum writes/reads, making I/O more
> CPU-efficient. Additionally, usable space would increase, and write
> amplification, both in Btrfs and eventually at the device level, would
> be reduced [*].

I have a couple of observations.

First of all, there is no inherent benefit to PI if it is generated at
the same time as the ECC. The ECC is usually far superior when it comes
to protecting data at rest. And you'll still get an error if uncorrected
corruption is detected. So BLK_INTEGRITY_OFFLOAD_NO_BUF does not offer
any benefits in my book.

The motivation for T10 PI is that it is generated in close temporal
proximity to the data. I.e. ideally the PI protecting the data is
calculated as soon as the data has been created in memory. And then the
I/O will eventually be queued, submitted, traverse the kernel, through
the storage fabric, and out to the end device. The PI and data have
traveled along different paths (potentially, more on that later) to get
there. The device will calculate the ECC and then perform a validation
of the PI wrt. to the data buffer. And if those two line up, we know the
ECC is also good. At that point we have confirmed that the data to be
stored matches the data that was used as input when the PI was generated
N seconds ago in host memory. And therefore we can write.

I.e. the goal of PI is protect against problems that happen between data
creation time and the data being persisted to media. Once the ECC has
been calculated, PI essentially stops being interesting.

The second point I would like to make is that the separation between PI
and data that we introduced with DIX, and which NVMe subsequently
adopted, was a feature. It was not just to avoid the inconvenience of
having to deal with buffers that were multiples of 520 bytes in host
memory. The separation between the data and its associated protection
information had proven critical for data protection in many common
corruption scenarios. Inline protection had been tried and had failed to
catch many of the scenarios we had come across in the field.

At the time T10 PI was designed spinning rust was the only game in town.
And nobody was willing to take the performance hit of having to seek
twice per I/O to store PI separately from the data. And while schemes
involving sending all the PI ahead of the data were entertained, they
never came to fruition. Storing 512+8 in the same sector was a necessity
in the context of SCSI drives, not a desired behavior. Addressing that
in DIX was key.

So to me, it's a highly desirable feature that btrfs stores its
checksums elsewhere on media. But that's obviously a trade-off a user
can make. In some cases media WAR may be more important than extending
the protection envelope for the data, that's OK. I would suggest you
look at using CRC32C given the intended 4KB block use case, though,
because the 16-bit CRC isn't fantastic for large blocks.

-- 
Martin K. Petersen	Oracle Linux Engineering