On Tue, 20 Aug 2024, David Chu wrote: > Hi, > > I was wondering how (excluding possible crashes) dm-integrity > guarantees that the hashes that it stores for each sector is > consistent with the data for each sector. > > I've been told that outside of FUAs or flushes, block IOs can be > arbitrarily reordered (see this article > https://lwn.net/Articles/400541/). So, let's say there are 2 write > operations, A and B, writing to the same sector, not separated by any > FUA or flushes. We know that even if they are concurrent, dm-integrity > will ensure that they are not submitted concurrently: write A must > complete first (its end_io is called) before write B is submitted. > This is the add_new_range() logic, if I am interpreting it correctly. > In this case, the hash of write B is written to disk. > > Can the disk order write B before write A, thereby making the hash > check fail? Or is there an ordering guarantee, that if write B is > submitted after write A's end_io, it must be ordered after? The > add_new_range() logic seems to suggest that that is the case. Are > there any special flags that dm-integrity appends in order to get this > ordering guarantee? > > Thanks, > David Hi There are several modes for that: 'J' - journaling - dm-integrity writes the data and metadata first into a journal, then it sends a flush to make the journal persistent and then it writes the data and metadata to the places where it belongs. If the system crashes before the journal is written, the journal is discarded and written data is not visible at all. If the system crashes after the journal is written, the journal is replayed to make sure that the on-disk data and metadata are consistent with the journal. The downside of journaling is that it degrades write performance twice, because it writes the data twice (once in the journal and once to the place where it belongs). 'B' - bitmap - dm-integrity maintains a bitmap, each bit represents a region on the disk. If it writes to some region, it sets the bit in the bitmap, if it finishes writing to some region, it clears the bit. If the machine crashes, the checksum is recalculated for all the regions that have a bit set. This doesn't degrade write performance twice, but it is less resilient - if the disk corrupts data for some sector and the sector happens to have the bit set while a crash happens, the data corruption is not detected and the incorrect checksum for the incorrect data is generated. 'D' - do nothing - dm-integrity doesn't do anything to try to maintain data/metadata integrity - if the system crashes, the metadata may be corrupted. It may be useful for things like operating system installation, where you don't recover from a crash at all. 'I' - inline - the integrity metadata is placed into an extra data field provided by the NVMe device. You need special (and expensive) NVMe devices for that. They provide 8-byte or 64-byte auxiliary metadata per sector. The atomicity of data+metadata write is provided by the hardware. 'R' - recovery mode - that is read-only and suitable for data recovery. In this mode, dm-integrit doesn't replay the journal and doesn't check checksums at all. You can use this mode in situations like I/O errors in the journal, where the device couldn't be activated with any of the normal modes. Mikulas