[Bug 202859] New: Corruption when reading from disk with 32-core processor

bugzilla-daemon@xxxxxxxxxxxxxxxxxxx · Mon, 11 Mar 2019 02:56:02 +0000

https://bugzilla.kernel.org/show_bug.cgi?id=202859

            Bug ID: 202859
           Summary: Corruption when reading from disk with 32-core
                    processor
           Product: SCSI Drivers
           Version: 2.5
    Kernel Version: 4.14.x, 4.15.x, 4.19.x
          Hardware: x86-64
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Other
          Assignee: scsi_drivers-other@xxxxxxxxxxxxxxxxxxxx
          Reporter: daedalus@xxxxxxxxxxxxxxx
        Regression: No

Created attachment 281693
  --> https://bugzilla.kernel.org/attachment.cgi?id=281693&action=edit
script I have used to trigger the problem, under default settings this script
usually reproduces the problem within 2 hours

I have been debugging a Dell R7415 with 32-core AMD EPYC 7551P processor and
the issue is that I get silent data corruption after few hours of intensive
disk I/O load. Also this has been verified on 2 different servers with same
components. I believe the problem could be that the PERC H330 Mini firmware is
somehow faulty or the megaraid_sas driver is broken.

At first I was running a HW-raid setup on the H330 Mini controller that was on
the server and got XFS filesystem corrupted more or less beyond repair. During
subsequent testing I converted the disks into JBOD mode on the controller and
made individual BTRFS filesystems on all disks to see checksum errors in case
they return bad data. During testing I made a small script that will usually
create the problem within 2 hours (attached as dell_fs_test.sh).

The problem starts by disks returning checksum errors (even disks that are not
written to). In my test setup I have 1 OS disk mounted read-only and 6 other
disks mounted read-write (seen in the script). When the issue is triggered the
OS-disk starts giving out bad data as well. Any ramfs disks/live media doesn't
seem to be affected so only the disks behind the H330 Mini controller. After
prolonged periods of checksum errors, I managed to even make the disk
capacities jump around 512 to whatever they really are by simple dd
if=/dev/urandom of=/dev/sdX bs=1M count=100 (see below)

[26301.605563] sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled,
supports DPO and FUA
[26304.242496] sd 0:0:1:0: [sdb] Sector size 0 reported, assuming 512.
[26304.244081] sd 0:0:1:0: [sdb] 1 512-byte logical blocks: (512 B/512 B)
[26304.244083] sd 0:0:1:0: [sdb] 0-byte physical blocks
[26304.245853] sdb: detected capacity change from 480103981056 to 512
[26314.315108] sd 0:0:2:0: [sdc] Write cache: enabled, read cache: enabled,
supports DPO and FUA
[26683.822304] sd 0:0:2:0: [sdc] Sector size 0 reported, assuming 512.
[26683.824020] sd 0:0:2:0: [sdc] 1 512-byte logical blocks: (512 B/512 B)
[26683.824022] sd 0:0:2:0: [sdc] 0-byte physical blocks
[26683.825751] sd 0:0:2:0: [sdc] Write cache: enabled, read cache: enabled,
supports DPO and FUA
[26683.825754] sdc: detected capacity change from 480103981056 to 512
[26684.020835] sd 0:0:2:0: [sdc] Sector size 0 reported, assuming 512.
[26946.615148] sd 0:0:3:0: [sdd] Sector size 0 reported, assuming 512.
[26946.617214] sd 0:0:3:0: [sdd] 1 512-byte logical blocks: (512 B/512 B)
[26946.617216] sd 0:0:3:0: [sdd] 0-byte physical blocks
[26946.619055] sd 0:0:3:0: [sdd] Write cache: disabled, read cache: enabled,
supports DPO and FUA
[26946.620292] sdd: detected capacity change from 4000787030016 to 512

I have managed to work around the problem by limiting the CPU to 24 cores (48
threads) in BIOS and haven't been able to reproduce any corruption with such
limitation but immediately when switching back to the full 32c/64t
configuration the corruption starts happening again.

In an effort of investigating the issue further I toyed around megaraid_sas
module parameters and it would seem that setting smp_affinity_enable=0 to the
module stops the problem from happening or at least makes it less likely to
happen. At the time I'm writing this I have been running 4 hours of stress on
the disks and haven't produced any corruption.

Oh and the controller firmware log doesn't show any errors. OS is silent too
unless using a checksumming FS such as BTRFS (or when something like XFS
metadata gets hosed).

Right now I'm at a loss on how to further debug the problem so here is my
report. Feel free to ask for more details :)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.