Re: Weird EROFS data corruption

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 2023/12/4 01:01, Juhyung Park wrote:
Hi Gao,

On Mon, Dec 4, 2023 at 1:52 AM Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> wrote:

Hi Juhyung,

On 2023/12/4 00:22, Juhyung Park wrote:
(Cc'ing f2fs and crypto as I've noticed something similar with f2fs a
while ago, which may mean that this is not specific to EROFS:
https://lore.kernel.org/all/CAD14+f2nBZtLfLC6CwNjgCOuRRRjwzttp3D3iK4Of+1EEjK+cw@xxxxxxxxxxxxxx/
)

Hi.

I'm encountering a very weird EROFS data corruption.

I noticed when I build an EROFS image for AOSP development, the device
would randomly not boot from a certain build.
After inspecting the log, I noticed that a file got corrupted.

Is it observed on your laptop (i7-1185G7), yes? or some other arm64
device?

Yes, only on my laptop. The arm64 device seems fine.
The reason that it would not boot was that the host machine (my
laptop) was repacking the EROFS image wrongfully.

The workflow is something like this:
Server-built EROFS AOSP image -> Image copied to laptop -> Laptop
mounts the EROFS image -> Copies the entire content to a scratch
directory (CORRUPT!) -> Changes some files -> mkfs.erofs

So the device is not responsible for the corruption, the laptop is.

Ok.




After adding a hash check during the build flow, I noticed that EROFS
would randomly read data wrong.

I now have a reliable method of reproducing the issue, but here's the
funny/weird part: it's only happening on my laptop (i7-1185G7). This
is not happening with my 128 cores buildfarm machine (Threadripper
3990X).>
I first suspected a hardware issue, but:
a. The laptop had its motherboard replaced recently (due to a failing
physical Type-C port).
b. The laptop passes memory test (memtest86).
c. This happens on all kernel versions from v5.4 to the latest v6.6
including my personal custom builds and Canonical's official Ubuntu
kernels.
d. This happens on different host SSDs and file-system combinations.
e. This only happens on LZ4. LZ4HC doesn't trigger the issue.
f. This only happens when mounting the image natively by the kernel.
Using fuse with erofsfuse is fine.

I think it's a weird issue with inplace decompression because you said
it depends on the hardware.  In addition, with your dataset sadly I
cannot reproduce on my local server (Xeon(R) CPU E5-2682 v4).

As I feared. Bummer :(


What is the difference between these two machines? just different CPU or
they have some other difference like different compliers?

I fully and exclusively control both devices, and the setup is almost the same.
Same Ubuntu version, kernel/compiler version.

But as I said, on my laptop, the issue happens on kernels that someone
else (Canonical) built, so I don't think it matters.

The only thing I could say is that the kernel side has optimized
inplace decompression compared to fuse so that it will reuse the
same buffer for decompression but with a safe margin (according to
the current lz4 decompression implementation).  It shouldn't behave
different just due to different CPUs.  Let me find more clues
later, also maybe we should introduce a way for users to turn off
this if needed.

Thanks,
Gao Xiang




[Index of Archives]     [Kernel]     [Gnu Classpath]     [Gnu Crypto]     [DM Crypt]     [Netfilter]     [Bugtraq]
  Powered by Linux