Re: [RFC 0/3] Btrfs checksum offload

"Martin K. Petersen" <martin.petersen@xxxxxxxxxx> · Mon, 03 Feb 2025 14:31:13 -0500

Christoph,

> Except for SSDs it generally doesn't - the fact that they are written
> at the same time means there is a very high chance they will end up
> on media together for traditional SSDs designs.
>
> This might be different when explicitly using some form of data
> placement scheme, and SSD vendors might be able to place PI/metadata
> different under the hood when using a big enough customer aks for it
> (they might not be very happy about the request :)).

There was a multi-vendor effort many years ago (first gen SSD era) to
make vendors guarantee that metadata and data would be written to
different channels. But performance got in the way, obviously.

> One thing that I did implement for my XFS hack/prototype is the ability
> to store a crc32c in the non-PI metadata support by nvme.  This allows
> for low overhead data checksumming as you don't need a separate data
> structure to track where the checksums for a data block are located and
> doesn't require out of place writes.  It doesn't provide a reg tag
> equivalent or device side checking of the guard tag unfortunately.

That sounds fine. Again, I don't have a problem with having the ability
to choose whether checksum placement or WAF is more important for a
given application.

> I never could come up with a good use of the app_tag for file systems,
> so not wasting space for that is actually a good thing.

I wish we could just do 4 bytes of CRC32C + 4 bytes of ref tag. I think
that would be a reasonable compromise between space and utility. But we
can't do that because of the app tag escape. We're essentially wasting 2
bytes per block to store a single bit flag.

In general I think 4096+16 is a reasonable format going forward. With
either CRC32C or CRC64 plus full LBA as ref tag.

-- 
Martin K. Petersen	Oracle Linux Engineering