Re: DM-Integrity journal replay behaviour

Marco Felsch <m.felsch@xxxxxxxxxxxxxx> · Thu, 13 Mar 2025 00:30:10 +0100

Hi,

sorry for the late reply but we had to run several tests and analyze
multiple test outputs offline via a small self-written tool.

On 25-02-14, Mikulas Patocka wrote:
> Hi
> 
> On Tue, 11 Feb 2025, Marco Felsch wrote:
> 
> > Hi all,
> > 
> > as written in the subject we do see an odd dm-integrity behaviour during
> > the journal replay step.
> > 
> > First things first, a short introduction to our setup:
> >  - Linux v6.8
> >  - We do have a dm-integrity+dm-crypt setup to provide an authenticated
> >    encrypted ext4-based rw data partition.
> >  - The setup is done with a custom script [1] since we are making use of
> >    the kernel trusted-keys infrastructure which isn't supported by LUKS2
> >    at the moment.
> >  - The device has no power failsafe e.g. hard power-cuts can appear.
> >    Therefore we use the dm-integrity J(ournal) mode.
> >  - The storage backend is an eMMC with 512Byte block size.
> 
> Could you retest it with an eMMC or SDCARD from a different vendor, just 
> to test if it is hardware issue?

We saw the issue on different eMMC devices from different manufacturers.

> >  - We use the dm-integrity 4K block size option to reduce the
> >    tag/metadata overhead.
> > 
> > From time to time we do see "AEAD ERROR"s [2] while fsck tries to repair
> > the filesystem which of course abort the fsck run.
> > After a while within the rescue shell
> 
> So, when you run fsck again from the rescue shell (without deactivating 
> and activating the device), the bug goes away? What is the time interval 
> after which the bug goes away?

This happened from time to time, yes but only in rare cases and I'm not
sure if our tester did something wrong.

> Could you upload somewhere the image of the eMMC storage when the bug 
> happens and send me a link to it, so that I can look what kind of 
> corruption is there?

Of course I need to align it with our customer but since the data is
encrypted you could get the dm-integrity dump.

> > and a following reboot the fsck
> > run on the same file system doesn't trigger any "AEAD ERROR".
> > 
> > The dm-integrity table is added twice [1] since we gather the
> > provided_data_sectors information from the first call. I know that this
> > isn't optimal. The provided_data_sectors should be stored and not
> > gathered again but this shouldn't bother a system with a valid
> > dm-integrity superblock already written.
> 
> That should work fine - there is no problem with activating the device 
> twice.

Also with activating it with differnet sizes as we do? I think there is
no problem with it too. Our script doesn't different between the initial
setup (no superblock available) and the "normal" setup.

> > To debug the issue we uncommented the "#define DEBUG_PRINT" and noticed
> > that the replay is happening twice [2] albeit this should be a
> > synchronous operation and once the dm resume returned successfully the
> > replay should be done.
> 
> The replay happens twice because you activate it twice - this should be 
> harmless because each replay replays the same data - there is "replaying 
> 364 sections, starting at 143, commit seq 2" twice in the log.

Yes, but after replaying it the first time we thought the code marks the
entries as unused and don't have to replay these entries a 2nd time. But
as I said, there should be no issue with it.

> > We also noticed that once we uncommented "#define DEBUG_PRINT" it was
> > harder to trigger the issue.
> > 
> > We also checked the eMMC bus to see if the CMD23 is sent correctly (with
> > the reliable write bit set) in case of a FUA request which is the case.
> > So now with the above knowledge we suspect the replay path to be not
> > synchronous, the dm resume is returning too early, while not all of the
> > writes kicked off by copy_from_journal having reached the storage.
> 
> There is "struct journal_completion comp" in do_journal_write and 
> "wait_for_completion_io(&comp.comp);" at the end of do_journal_write - 
> that should wait until all I/O submitted by copy_from_journal finishes.
>
> > Maybe you guys have some pointers we could follow or have an idea of
> > what may go wrong here.
> > 
> > Regards,
> >   Marco
> 
> I read through the dm-integrity source code and I didn't find anything.
> 
> There are two dm-crypt bugs that could cause your problems - see the 
> upstream commits 8b8f8037765757861f899ed3a2bfb34525b5c065 and 
> 9fdbbdbbc92b1474a87b89f8b964892a63734492. Please, backport these commits 
> to your kernel and re-run the tests.

This helped, there are also other dm-crypt fixes which looked closely
related. Therefore we updated to 6.12.16 and ran the tests. Tests which
could reproduce the issue quite sufficient are now passing and it really
looks like dm-crypt was the issue :/

The issue could be reproduced by the following script which was started
right after the boot:

| #!/bin/bash
| 
| set -x
| 
| suffix=$(hexdump /dev/urandom -n4 -e '"%u"')
| suffix=$((suffix%220))
| testdir=/mnt/data/test
| testfile=$testdir/test.$suffix
| 
| # Make testdir
| mkdir -p $testdir
| 
| [ -f $testfile ] && rm $testfile
| 
| dd if=/dev/mapper/data of=/dev/null && \
|  dd if=/dev/urandom of=$testfile bs=1M oflag=sync count=1 && \
|  dd if=/dev/mapper/data of=/dev/null && \
|  reboot -f
| 
| echo "Something failed, please check!"

Our data partition was quite small (300MB) which lead the journal-size
to hold up to ~2MB. Since we use the ext4 on top of it we used a size of
1MB. If reading the whole device before or after the script stops and we
dumped the raw eMMC partition (mmcblkXpY).

Afterwards we analyzed the dump by our small tool to see if we can found
the AEAD sector within the journal, which was the case.

Now we checked if this particular sector (8 sectors since we provide 4K
IO-size to the upper layers) can be found on the data area. Which was
the case.

Lastly we checked if we can find the entry metadata journal content
within the metadata area. Which was the case too.

Both the data and metadata belong to the same "area" so we do assume
that the journal was correctly replayed but got the wrong
<data><metadata> pair from the dm-crypt layer.

Please correct me if our conclusion is wrong, but after the kernel
update the issues are gone for now (we do need more test samples).

Thanks,
Marco