On Thu, 13 Mar 2025, Marco Felsch wrote: > Hi, > > sorry for the late reply but we had to run several tests and analyze > multiple test outputs offline via a small self-written tool. > > On 25-02-14, Mikulas Patocka wrote: > > Hi > > > > On Tue, 11 Feb 2025, Marco Felsch wrote: > > > > > Hi all, > > > > > > as written in the subject we do see an odd dm-integrity behaviour during > > > the journal replay step. > > > > > > First things first, a short introduction to our setup: > > > - Linux v6.8 > > > - We do have a dm-integrity+dm-crypt setup to provide an authenticated > > > encrypted ext4-based rw data partition. > > > - The setup is done with a custom script [1] since we are making use of > > > the kernel trusted-keys infrastructure which isn't supported by LUKS2 > > > at the moment. > > > - The device has no power failsafe e.g. hard power-cuts can appear. > > > Therefore we use the dm-integrity J(ournal) mode. > > > - The storage backend is an eMMC with 512Byte block size. > > > > Could you retest it with an eMMC or SDCARD from a different vendor, just > > to test if it is hardware issue? > > We saw the issue on different eMMC devices from different manufacturers. > > > > - We use the dm-integrity 4K block size option to reduce the > > > tag/metadata overhead. > > > > > > From time to time we do see "AEAD ERROR"s [2] while fsck tries to repair > > > the filesystem which of course abort the fsck run. > > > After a while within the rescue shell > > > > So, when you run fsck again from the rescue shell (without deactivating > > and activating the device), the bug goes away? What is the time interval > > after which the bug goes away? > > This happened from time to time, yes but only in rare cases and I'm not > sure if our tester did something wrong. > > > Could you upload somewhere the image of the eMMC storage when the bug > > happens and send me a link to it, so that I can look what kind of > > corruption is there? > > Of course I need to align it with our customer but since the data is > encrypted you could get the dm-integrity dump. > > > > and a following reboot the fsck > > > run on the same file system doesn't trigger any "AEAD ERROR". > > > > > > The dm-integrity table is added twice [1] since we gather the > > > provided_data_sectors information from the first call. I know that this > > > isn't optimal. The provided_data_sectors should be stored and not > > > gathered again but this shouldn't bother a system with a valid > > > dm-integrity superblock already written. > > > > That should work fine - there is no problem with activating the device > > twice. > > Also with activating it with differnet sizes as we do? I think there is > no problem with it too. Our script doesn't different between the initial > setup (no superblock available) and the "normal" setup. > > > > To debug the issue we uncommented the "#define DEBUG_PRINT" and noticed > > > that the replay is happening twice [2] albeit this should be a > > > synchronous operation and once the dm resume returned successfully the > > > replay should be done. > > > > The replay happens twice because you activate it twice - this should be > > harmless because each replay replays the same data - there is "replaying > > 364 sections, starting at 143, commit seq 2" twice in the log. > > Yes, but after replaying it the first time we thought the code marks the > entries as unused and don't have to replay these entries a 2nd time. But > as I said, there should be no issue with it. > > > > We also noticed that once we uncommented "#define DEBUG_PRINT" it was > > > harder to trigger the issue. > > > > > > We also checked the eMMC bus to see if the CMD23 is sent correctly (with > > > the reliable write bit set) in case of a FUA request which is the case. > > > So now with the above knowledge we suspect the replay path to be not > > > synchronous, the dm resume is returning too early, while not all of the > > > writes kicked off by copy_from_journal having reached the storage. > > > > There is "struct journal_completion comp" in do_journal_write and > > "wait_for_completion_io(&comp.comp);" at the end of do_journal_write - > > that should wait until all I/O submitted by copy_from_journal finishes. > > > > > Maybe you guys have some pointers we could follow or have an idea of > > > what may go wrong here. > > > > > > Regards, > > > Marco > > > > I read through the dm-integrity source code and I didn't find anything. > > > > There are two dm-crypt bugs that could cause your problems - see the > > upstream commits 8b8f8037765757861f899ed3a2bfb34525b5c065 and > > 9fdbbdbbc92b1474a87b89f8b964892a63734492. Please, backport these commits > > to your kernel and re-run the tests. > > This helped, there are also other dm-crypt fixes which looked closely > related. Therefore we updated to 6.12.16 and ran the tests. Tests which > could reproduce the issue quite sufficient are now passing and it really > looks like dm-crypt was the issue :/ > > The issue could be reproduced by the following script which was started > right after the boot: > > | #!/bin/bash > | > | set -x > | > | suffix=$(hexdump /dev/urandom -n4 -e '"%u"') > | suffix=$((suffix%220)) > | testdir=/mnt/data/test > | testfile=$testdir/test.$suffix > | > | # Make testdir > | mkdir -p $testdir > | > | [ -f $testfile ] && rm $testfile > | > | dd if=/dev/mapper/data of=/dev/null && \ > | dd if=/dev/urandom of=$testfile bs=1M oflag=sync count=1 && \ > | dd if=/dev/mapper/data of=/dev/null && \ > | reboot -f > | > | echo "Something failed, please check!" > > Our data partition was quite small (300MB) which lead the journal-size > to hold up to ~2MB. Since we use the ext4 on top of it we used a size of > 1MB. If reading the whole device before or after the script stops and we > dumped the raw eMMC partition (mmcblkXpY). > > Afterwards we analyzed the dump by our small tool to see if we can found > the AEAD sector within the journal, which was the case. > > Now we checked if this particular sector (8 sectors since we provide 4K > IO-size to the upper layers) can be found on the data area. Which was > the case. > > Lastly we checked if we can find the entry metadata journal content > within the metadata area. Which was the case too. > > Both the data and metadata belong to the same "area" so we do assume > that the journal was correctly replayed but got the wrong > <data><metadata> pair from the dm-crypt layer. > > Please correct me if our conclusion is wrong, but after the kernel > update the issues are gone for now (we do need more test samples). It's nice to hear that the issue is fixed. Thanks for testing it. > Thanks, > Marco Mikulas