Hi Kanchan! > There is value in avoiding Copy-on-write (COW) checksum tree on a > device that can anyway store checksums inline (as part of PI). This > would eliminate extra checksum writes/reads, making I/O more > CPU-efficient. Additionally, usable space would increase, and write > amplification, both in Btrfs and eventually at the device level, would > be reduced [*]. I have a couple of observations. First of all, there is no inherent benefit to PI if it is generated at the same time as the ECC. The ECC is usually far superior when it comes to protecting data at rest. And you'll still get an error if uncorrected corruption is detected. So BLK_INTEGRITY_OFFLOAD_NO_BUF does not offer any benefits in my book. The motivation for T10 PI is that it is generated in close temporal proximity to the data. I.e. ideally the PI protecting the data is calculated as soon as the data has been created in memory. And then the I/O will eventually be queued, submitted, traverse the kernel, through the storage fabric, and out to the end device. The PI and data have traveled along different paths (potentially, more on that later) to get there. The device will calculate the ECC and then perform a validation of the PI wrt. to the data buffer. And if those two line up, we know the ECC is also good. At that point we have confirmed that the data to be stored matches the data that was used as input when the PI was generated N seconds ago in host memory. And therefore we can write. I.e. the goal of PI is protect against problems that happen between data creation time and the data being persisted to media. Once the ECC has been calculated, PI essentially stops being interesting. The second point I would like to make is that the separation between PI and data that we introduced with DIX, and which NVMe subsequently adopted, was a feature. It was not just to avoid the inconvenience of having to deal with buffers that were multiples of 520 bytes in host memory. The separation between the data and its associated protection information had proven critical for data protection in many common corruption scenarios. Inline protection had been tried and had failed to catch many of the scenarios we had come across in the field. At the time T10 PI was designed spinning rust was the only game in town. And nobody was willing to take the performance hit of having to seek twice per I/O to store PI separately from the data. And while schemes involving sending all the PI ahead of the data were entertained, they never came to fruition. Storing 512+8 in the same sector was a necessity in the context of SCSI drives, not a desired behavior. Addressing that in DIX was key. So to me, it's a highly desirable feature that btrfs stores its checksums elsewhere on media. But that's obviously a trade-off a user can make. In some cases media WAR may be more important than extending the protection envelope for the data, that's OK. I would suggest you look at using CRC32C given the intended 4KB block use case, though, because the 16-bit CRC isn't fantastic for large blocks. -- Martin K. Petersen Oracle Linux Engineering