Re: {WHAT?} read checksum verification

pg@xxxxxxxxxxxxxxxxxxxxx (Peter Grandi) · Wed, 12 Jul 2023 23:29:09 +0100

> I used NILFS over ISCSI. I had random block corruption during
> one week, silently destroying data until NILFS finally
> crashed. First of all, I thought about a NILFS bug, so I
> created a BTRFS volume

I use both for main filesystem and backup for "diversity", and I
value NILFS2 because it is very robust (I don't really use
either filesystems snapshotting features).

> and restored the backup from one week earlier to it. After
> minutes, the BTRFS volume gave checksum errors, so the
> culrprit was found, the ISCSI server.

There used to be a good argument that checksumming (or
compressing) data should be end-to-end and checksumming (or
compressing) in the filesystem is a bit too much, but when LOGFS
and NILFS/nILFS2 were designed I guess CPUs were too slow to
checksum everything. Even excellent recent filesystems like F2FS
don't do data integrity checking for various reasons though.

In theory your iSCSI or its host-adapter should have told you
about errors... Many can enable after-write verification (even
if its quite expensive). Alternatively some people run regularly
silent-corruption detecting daemons if their hardware does not
report corruption or it escapes the relevant checks for various
reasons:

https://indico.desy.de/event/257/contributions/58082/attachments/37574/46878/kelemen-2007-HEPiX-Silent_Corruptions.pdf
https://storagemojo.com/2007/09/19/cerns-data-corruption-research/

> [...] NILFS creates checksums on block writes. It would really
> be a good addition to verify these checksums on read [...]

It would be interesting to have data integrity checking or
compression in NILFS2, and log-structured filesystem makes that
easier (Btrfs code is rather complex instead), but modifying
mature and stable filesystems is a risky thing...

My understanding is that these checksums are not quite suitable
for data integrity checks but are designed for log-sequence
recovery, a bit like journal checksums for journal-based
filesystems.

https://www.spinics.net/lists/linux-nilfs/msg01063.html
"nilfs2 store checksums for all data. However, at least the
current implementation does not verify it when reading.
Actually, the main purpose of the checksums is recovery after
unexpected reboot; it does not suit for per-file data
verification because the checksums are given per ``log''."