Re: [PATCH v1 00/16] ext4: Add metadata checksumming

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Mon, 5 Sep 2011 11:45:24 -0700

On Sun, Sep 04, 2011 at 07:41:03AM -0400, Martin K. Petersen wrote:
> >>>>> "Darrick" == Darrick J Wong <djwong@xxxxxxxxxx> writes:
> 
> Darrick,
> 
> Darrick> Furthermore, the nice thing about the in-filesystem checksum is
> Darrick> that we bake in other things like the FS UUID and the inode
> Darrick> number, which gives you a somewhat better assurance that the
> Darrick> data block belongs to the fs and the file that the code think
> Darrick> it belongs to.
> 
> Yeah, I view DIF/DIX mostly as in-flight protection for writes. Whereas
> FS metadata checksumming is great for problem detection at read time.
> 
> Another problem with using the DIF app tag to store filesystem metadata
> is that many array vendors use it internally and thus only disk drives
> are likely to provide the app tag space.
> 
> 
> Darrick> The DIX interface allows for a 32-bit block number and a 16-bit
> Darrick> application tag ... which is unfortunately small given 64-bit
> Darrick> block numbers and 32-bit inode numbers.
> 
> I never understood the 32-bit ref tag. Seems silly to have a check that
> wraps at the exact boundary where problems are most likely to occur.
> 
> I advocated for a DIF Type with 16-bit guard tag and 48-bit ref tag but
> that never went anywhere. Too bad - would have been easy for the storage
> vendors to implement.

> 
> 
> Darrick> As a side note, the crc-t10dif implementation is quite slow --
> Darrick> the hardware accelerated crc32c is 15x faster, and the sw
> Darrick> implementation is usually 3-6x faster.  I suspect somebody will
> Darrick> want to fix that before DIF becomes more widespread...
> 
> The CRC32C op on Nehalem and beyond is really, really fast. It's
> essentially free except for pulling the data through the cache. So it's
> not entirely fair to use that as baseline for a pure software
> implementation. What is the faster sw implementation are you referring
> to, btw.?

I have some benchmarking data for various crc algorithms here:
https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums#Benchmarking

The "faster sw implementation" that I was talking about is the slice-by-8
algorithm that I sent to the crypto list a few days ago that's based off of Bob
Pearson's slice-by-8 crc32 patch.

In the huge table, "crc32c-by8-le" is crc32c slice-by-8.

> lib/crc-t10dif is a regular 256-entry table-based CRC implementation. It
> is done pretty much like all our other software CRCs. I seem to recall
> attempting a bigger table but that yielded worse real life results due
> to cache pollution.

Yes, the only downside to the slice-by-8 method is that it eats 8K of data
cache for the table.  Not a huge issue on recent Intel and POWER where the L1D
is 32K, but I imagine it could be painful elsewhere.

Do you know of any faster crc16 algorithms?  I guess it wouldn't be hard to
make a family of crcs, each with different cache/speed characteristics.

> On Westmere and beyond it is possible to accelerate generic CRC
> calculation using the PCLMULQDQ operation. There are many of our CRC
> functions that could benefit from this. However, so far intel have not
> been willing to contribute the relevant code to Linux.
> 
> 
> Darrick> The good news is that if you're really worried about integrity,
> Darrick> metadata_csum and DIF/DIX aren't mutually exclusive features.
> Darrick> Rejecting corrupted write commands at write time seems like a
> Darrick> useful feature. :)
> 
> Yup!
> 
> -- 
> Martin K. Petersen	Oracle Linux Engineering
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html