On Tue, 28 Jan 2020 13:50:56 +0900, ì?¤ì¤?í?? said: (Lukas - there's stuff for you further down...) > If you write checksum for some data, ordering between checksum and data is > not needed. Actually, it is. > When the crash occurs, we just recalculate checksum with data and compare > the recalculated one with a written one. And it's required because the read of the data that gets a checksum-data mismatch may be weeks, months, or even years after a crash happens. You don't have any history to go on, *only* on the data as found and the two checksums. You can't safely just recalculate the checksum, because that's the whole *point* of the checksum - to detect that something has gone wrong. And if it's the data that has gone wrong, just recalculating the checksum is the exact wrong thing to do. Failing the read with a -EIO, and not touching the data or checksums is the proper thing to do. > Even though checksum is written first, the recalculated checksum will be > different with the written checksum because data is not written. You missed an important point. If you read the block and the checksum and they don't match, you don't know if the checksum is wrong because it's stale, or if the data has been corrupted. That's part of why there's 2 checksums, one before and one after the data block. That way, if the two checksums match each other but not the data, you know that something has corrupted the data. If the two checksums don't match, it gets more interesting: If the first one matches the data and the second doesn't, then either the second one has gotten corrupted, or the system died between writing the data and the second checksum. But that's OK, because the first checksum says the data update did succeed, so simply patching the second checksum is OK. If the first one doesn't match and the second one *does*, then either the system died between the first update and the data, or the first one is corrupted - and you don't have a good way to distinguish between them unless you have timestamps. If neither checksum matches the data, then you're pretty sure the system died between the first checksum and finishing the data write. Questions for Lukas: First off, see my comment about -EIO. Do you have plans for an ioctl or other way for userspace to get the two checksums so diagnostic programs can do better error diagnosis/recovery? If I understand what you're doing, each 4096 (or whatever) block will actually take (4096 + 2* checksum size) bytes, which means each logical consecutive block will be offset from the start of a physical block by some amount. This effectively means that you are guaranteed one read-modify-write and possibly two, for each write. (The other alternative is to devote an entire block to each checksum, but that triples the size and at that point you may as well just do a 2+1 raidset) Even if your hardware is willing to do the RMW cycle in hardware, that still hits you for at least one rotational latency, and possibly two. If you have to do the RMW in software, it gets a *lot* more painful (and actually *ensuring* atomic writes gets more challenging). At that point, are you still gaining performance over the current dm-integrity scheme? (There's also a lot more ugly that happens on high-end storage devices, where your logical device is actually a 8+2 RAID6 LUN striped across 10 volumes - even a single 4K write is guaranteed to be a RMW, and you need to do a 32K write to make it really be a write. IBM's GPFS, SGI's CXFS, and probably other high-end file systems as well, go another level of crazy in order to get high performace - you end up striping the filesystem across 4 or 8 LUNs, so you want a logical blocksize that gets you 4 or 8 times the 32K that each LUN wants to see. At which point the storage admin is ready to shoot the end user who writes a program that does 1K writes, causing your throughput to fall through the floor.. Been there, done that, it gets ugly quickly... :)
Attachment:
pgp7lSPVm6kAM.pgp
Description: PGP signature
_______________________________________________ Kernelnewbies mailing list Kernelnewbies@xxxxxxxxxxxxxxxxx https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies