Re: dm-integrity and write reordering

Mikulas Patocka <mpatocka@xxxxxxxxxx> · Wed, 21 Aug 2024 22:08:10 +0200 (CEST)

On Tue, 20 Aug 2024, David Chu wrote:

> Hi,
> 
> I was wondering how (excluding possible crashes) dm-integrity
> guarantees that the hashes that it stores for each sector is
> consistent with the data for each sector.
> 
> I've been told that outside of FUAs or flushes, block IOs can be
> arbitrarily reordered (see this article
> https://lwn.net/Articles/400541/). So, let's say there are 2 write
> operations, A and B, writing to the same sector, not separated by any
> FUA or flushes. We know that even if they are concurrent, dm-integrity
> will ensure that they are not submitted concurrently: write A must
> complete first (its end_io is called) before write B is submitted.
> This is the add_new_range() logic, if I am interpreting it correctly.
> In this case, the hash of write B is written to disk.
> 
> Can the disk order write B before write A, thereby making the hash
> check fail? Or is there an ordering guarantee, that if write B is
> submitted after write A's end_io, it must be ordered after? The
> add_new_range() logic seems to suggest that that is the case. Are
> there any special flags that dm-integrity appends in order to get this
> ordering guarantee?
> 
> Thanks,
> David

Hi

There are several modes for that:

'J' - journaling - dm-integrity writes the data and metadata first into a 
journal, then it sends a flush to make the journal persistent and then it 
writes the data and metadata to the places where it belongs.

If the system crashes before the journal is written, the journal is 
discarded and written data is not visible at all. If the system crashes 
after the journal is written, the journal is replayed to make sure that 
the on-disk data and metadata are consistent with the journal.

The downside of journaling is that it degrades write performance twice, 
because it writes the data twice (once in the journal and once to the 
place where it belongs).

'B' - bitmap - dm-integrity maintains a bitmap, each bit represents a 
region on the disk. If it writes to some region, it sets the bit in the 
bitmap, if it finishes writing to some region, it clears the bit. If the 
machine crashes, the checksum is recalculated for all the regions that 
have a bit set.

This doesn't degrade write performance twice, but it is less resilient - 
if the disk corrupts data for some sector and the sector happens to have 
the bit set while a crash happens, the data corruption is not detected and 
the incorrect checksum for the incorrect data is generated.

'D' - do nothing - dm-integrity doesn't do anything to try to maintain 
data/metadata integrity - if the system crashes, the metadata may be 
corrupted. It may be useful for things like operating system installation, 
where you don't recover from a crash at all.

'I' - inline - the integrity metadata is placed into an extra data field 
provided by the NVMe device. You need special (and expensive) NVMe devices 
for that. They provide 8-byte or 64-byte auxiliary metadata per sector. 
The atomicity of data+metadata write is provided by the hardware.

'R' - recovery mode - that is read-only and suitable for data recovery. In 
this mode, dm-integrit doesn't replay the journal and doesn't check 
checksums at all. You can use this mode in situations like I/O errors in 
the journal, where the device couldn't be activated with any of the normal 
modes.

Mikulas