Re: DM-Integrity journal replay behaviour

Mikulas Patocka <mpatocka@xxxxxxxxxx> · Thu, 13 Mar 2025 16:53:58 +0100 (CET)

On Thu, 13 Mar 2025, Marco Felsch wrote:

> Hi,
> 
> sorry for the late reply but we had to run several tests and analyze
> multiple test outputs offline via a small self-written tool.
> 
> On 25-02-14, Mikulas Patocka wrote:
> > Hi
> > 
> > On Tue, 11 Feb 2025, Marco Felsch wrote:
> > 
> > > Hi all,
> > > 
> > > as written in the subject we do see an odd dm-integrity behaviour during
> > > the journal replay step.
> > > 
> > > First things first, a short introduction to our setup:
> > >  - Linux v6.8
> > >  - We do have a dm-integrity+dm-crypt setup to provide an authenticated
> > >    encrypted ext4-based rw data partition.
> > >  - The setup is done with a custom script [1] since we are making use of
> > >    the kernel trusted-keys infrastructure which isn't supported by LUKS2
> > >    at the moment.
> > >  - The device has no power failsafe e.g. hard power-cuts can appear.
> > >    Therefore we use the dm-integrity J(ournal) mode.
> > >  - The storage backend is an eMMC with 512Byte block size.
> > 
> > Could you retest it with an eMMC or SDCARD from a different vendor, just 
> > to test if it is hardware issue?
> 
> We saw the issue on different eMMC devices from different manufacturers.
> 
> > >  - We use the dm-integrity 4K block size option to reduce the
> > >    tag/metadata overhead.
> > > 
> > > From time to time we do see "AEAD ERROR"s [2] while fsck tries to repair
> > > the filesystem which of course abort the fsck run.
> > > After a while within the rescue shell
> > 
> > So, when you run fsck again from the rescue shell (without deactivating 
> > and activating the device), the bug goes away? What is the time interval 
> > after which the bug goes away?
> 
> This happened from time to time, yes but only in rare cases and I'm not
> sure if our tester did something wrong.
> 
> > Could you upload somewhere the image of the eMMC storage when the bug 
> > happens and send me a link to it, so that I can look what kind of 
> > corruption is there?
> 
> Of course I need to align it with our customer but since the data is
> encrypted you could get the dm-integrity dump.
> 
> > > and a following reboot the fsck
> > > run on the same file system doesn't trigger any "AEAD ERROR".
> > > 
> > > The dm-integrity table is added twice [1] since we gather the
> > > provided_data_sectors information from the first call. I know that this
> > > isn't optimal. The provided_data_sectors should be stored and not
> > > gathered again but this shouldn't bother a system with a valid
> > > dm-integrity superblock already written.
> > 
> > That should work fine - there is no problem with activating the device 
> > twice.
> 
> Also with activating it with differnet sizes as we do? I think there is
> no problem with it too. Our script doesn't different between the initial
> setup (no superblock available) and the "normal" setup.
> 
> > > To debug the issue we uncommented the "#define DEBUG_PRINT" and noticed
> > > that the replay is happening twice [2] albeit this should be a
> > > synchronous operation and once the dm resume returned successfully the
> > > replay should be done.
> > 
> > The replay happens twice because you activate it twice - this should be 
> > harmless because each replay replays the same data - there is "replaying 
> > 364 sections, starting at 143, commit seq 2" twice in the log.
> 
> Yes, but after replaying it the first time we thought the code marks the
> entries as unused and don't have to replay these entries a 2nd time. But
> as I said, there should be no issue with it.
> 
> > > We also noticed that once we uncommented "#define DEBUG_PRINT" it was
> > > harder to trigger the issue.
> > > 
> > > We also checked the eMMC bus to see if the CMD23 is sent correctly (with
> > > the reliable write bit set) in case of a FUA request which is the case.
> > > So now with the above knowledge we suspect the replay path to be not
> > > synchronous, the dm resume is returning too early, while not all of the
> > > writes kicked off by copy_from_journal having reached the storage.
> > 
> > There is "struct journal_completion comp" in do_journal_write and 
> > "wait_for_completion_io(&comp.comp);" at the end of do_journal_write - 
> > that should wait until all I/O submitted by copy_from_journal finishes.
> >
> > > Maybe you guys have some pointers we could follow or have an idea of
> > > what may go wrong here.
> > > 
> > > Regards,
> > >   Marco
> > 
> > I read through the dm-integrity source code and I didn't find anything.
> > 
> > There are two dm-crypt bugs that could cause your problems - see the 
> > upstream commits 8b8f8037765757861f899ed3a2bfb34525b5c065 and 
> > 9fdbbdbbc92b1474a87b89f8b964892a63734492. Please, backport these commits 
> > to your kernel and re-run the tests.
> 
> This helped, there are also other dm-crypt fixes which looked closely
> related. Therefore we updated to 6.12.16 and ran the tests. Tests which
> could reproduce the issue quite sufficient are now passing and it really
> looks like dm-crypt was the issue :/
> 
> The issue could be reproduced by the following script which was started
> right after the boot:
> 
> | #!/bin/bash
> | 
> | set -x
> | 
> | suffix=$(hexdump /dev/urandom -n4 -e '"%u"')
> | suffix=$((suffix%220))
> | testdir=/mnt/data/test
> | testfile=$testdir/test.$suffix
> | 
> | # Make testdir
> | mkdir -p $testdir
> | 
> | [ -f $testfile ] && rm $testfile
> | 
> | dd if=/dev/mapper/data of=/dev/null && \
> |  dd if=/dev/urandom of=$testfile bs=1M oflag=sync count=1 && \
> |  dd if=/dev/mapper/data of=/dev/null && \
> |  reboot -f
> | 
> | echo "Something failed, please check!"
> 
> Our data partition was quite small (300MB) which lead the journal-size
> to hold up to ~2MB. Since we use the ext4 on top of it we used a size of
> 1MB. If reading the whole device before or after the script stops and we
> dumped the raw eMMC partition (mmcblkXpY).
> 
> Afterwards we analyzed the dump by our small tool to see if we can found
> the AEAD sector within the journal, which was the case.
> 
> Now we checked if this particular sector (8 sectors since we provide 4K
> IO-size to the upper layers) can be found on the data area. Which was
> the case.
> 
> Lastly we checked if we can find the entry metadata journal content
> within the metadata area. Which was the case too.
> 
> Both the data and metadata belong to the same "area" so we do assume
> that the journal was correctly replayed but got the wrong
> <data><metadata> pair from the dm-crypt layer.
> 
> Please correct me if our conclusion is wrong, but after the kernel
> update the issues are gone for now (we do need more test samples).

It's nice to hear that the issue is fixed. Thanks for testing it.

> Thanks,
> Marco

Mikulas