Re: Weird EROFS data corruption

Juhyung Park <qkrwngud825@xxxxxxxxx> · Mon, 4 Dec 2023 02:32:02 +0900

Hi Gao,

On Mon, Dec 4, 2023 at 2:22 AM Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> wrote:
>
>
>
> On 2023/12/4 01:01, Juhyung Park wrote:
> > Hi Gao,
> >
> > On Mon, Dec 4, 2023 at 1:52 AM Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> wrote:
> >>
> >> Hi Juhyung,
> >>
> >> On 2023/12/4 00:22, Juhyung Park wrote:
> >>> (Cc'ing f2fs and crypto as I've noticed something similar with f2fs a
> >>> while ago, which may mean that this is not specific to EROFS:
> >>> https://lore.kernel.org/all/CAD14+f2nBZtLfLC6CwNjgCOuRRRjwzttp3D3iK4Of+1EEjK+cw@xxxxxxxxxxxxxx/
> >>> )
> >>>
> >>> Hi.
> >>>
> >>> I'm encountering a very weird EROFS data corruption.
> >>>
> >>> I noticed when I build an EROFS image for AOSP development, the device
> >>> would randomly not boot from a certain build.
> >>> After inspecting the log, I noticed that a file got corrupted.
> >>
> >> Is it observed on your laptop (i7-1185G7), yes? or some other arm64
> >> device?
> >
> > Yes, only on my laptop. The arm64 device seems fine.
> > The reason that it would not boot was that the host machine (my
> > laptop) was repacking the EROFS image wrongfully.
> >
> > The workflow is something like this:
> > Server-built EROFS AOSP image -> Image copied to laptop -> Laptop
> > mounts the EROFS image -> Copies the entire content to a scratch
> > directory (CORRUPT!) -> Changes some files -> mkfs.erofs
> >
> > So the device is not responsible for the corruption, the laptop is.
>
> Ok.
>
> >
> >>
> >>>
> >>> After adding a hash check during the build flow, I noticed that EROFS
> >>> would randomly read data wrong.
> >>>
> >>> I now have a reliable method of reproducing the issue, but here's the
> >>> funny/weird part: it's only happening on my laptop (i7-1185G7). This
> >>> is not happening with my 128 cores buildfarm machine (Threadripper
> >>> 3990X).>
> >>> I first suspected a hardware issue, but:
> >>> a. The laptop had its motherboard replaced recently (due to a failing
> >>> physical Type-C port).
> >>> b. The laptop passes memory test (memtest86).
> >>> c. This happens on all kernel versions from v5.4 to the latest v6.6
> >>> including my personal custom builds and Canonical's official Ubuntu
> >>> kernels.
> >>> d. This happens on different host SSDs and file-system combinations.
> >>> e. This only happens on LZ4. LZ4HC doesn't trigger the issue.
> >>> f. This only happens when mounting the image natively by the kernel.
> >>> Using fuse with erofsfuse is fine.
> >>
> >> I think it's a weird issue with inplace decompression because you said
> >> it depends on the hardware.  In addition, with your dataset sadly I
> >> cannot reproduce on my local server (Xeon(R) CPU E5-2682 v4).
> >
> > As I feared. Bummer :(
> >
> >>
> >> What is the difference between these two machines? just different CPU or
> >> they have some other difference like different compliers?
> >
> > I fully and exclusively control both devices, and the setup is almost the same.
> > Same Ubuntu version, kernel/compiler version.
> >
> > But as I said, on my laptop, the issue happens on kernels that someone
> > else (Canonical) built, so I don't think it matters.
>
> The only thing I could say is that the kernel side has optimized
> inplace decompression compared to fuse so that it will reuse the
> same buffer for decompression but with a safe margin (according to
> the current lz4 decompression implementation).  It shouldn't behave
> different just due to different CPUs.  Let me find more clues
> later, also maybe we should introduce a way for users to turn off
> this if needed.

Cool :)

I'm comfortable changing and building my own custom kernel for this
specific laptop. Feel free to ask me to try out some patches.

Thanks.

>
> Thanks,
> Gao Xiang