https://bugzilla.kernel.org/show_bug.cgi?id=202859 Bug ID: 202859 Summary: Corruption when reading from disk with 32-core processor Product: SCSI Drivers Version: 2.5 Kernel Version: 4.14.x, 4.15.x, 4.19.x Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Other Assignee: scsi_drivers-other@xxxxxxxxxxxxxxxxxxxx Reporter: daedalus@xxxxxxxxxxxxxxx Regression: No Created attachment 281693 --> https://bugzilla.kernel.org/attachment.cgi?id=281693&action=edit script I have used to trigger the problem, under default settings this script usually reproduces the problem within 2 hours I have been debugging a Dell R7415 with 32-core AMD EPYC 7551P processor and the issue is that I get silent data corruption after few hours of intensive disk I/O load. Also this has been verified on 2 different servers with same components. I believe the problem could be that the PERC H330 Mini firmware is somehow faulty or the megaraid_sas driver is broken. At first I was running a HW-raid setup on the H330 Mini controller that was on the server and got XFS filesystem corrupted more or less beyond repair. During subsequent testing I converted the disks into JBOD mode on the controller and made individual BTRFS filesystems on all disks to see checksum errors in case they return bad data. During testing I made a small script that will usually create the problem within 2 hours (attached as dell_fs_test.sh). The problem starts by disks returning checksum errors (even disks that are not written to). In my test setup I have 1 OS disk mounted read-only and 6 other disks mounted read-write (seen in the script). When the issue is triggered the OS-disk starts giving out bad data as well. Any ramfs disks/live media doesn't seem to be affected so only the disks behind the H330 Mini controller. After prolonged periods of checksum errors, I managed to even make the disk capacities jump around 512 to whatever they really are by simple dd if=/dev/urandom of=/dev/sdX bs=1M count=100 (see below) [26301.605563] sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, supports DPO and FUA [26304.242496] sd 0:0:1:0: [sdb] Sector size 0 reported, assuming 512. [26304.244081] sd 0:0:1:0: [sdb] 1 512-byte logical blocks: (512 B/512 B) [26304.244083] sd 0:0:1:0: [sdb] 0-byte physical blocks [26304.245853] sdb: detected capacity change from 480103981056 to 512 [26314.315108] sd 0:0:2:0: [sdc] Write cache: enabled, read cache: enabled, supports DPO and FUA [26683.822304] sd 0:0:2:0: [sdc] Sector size 0 reported, assuming 512. [26683.824020] sd 0:0:2:0: [sdc] 1 512-byte logical blocks: (512 B/512 B) [26683.824022] sd 0:0:2:0: [sdc] 0-byte physical blocks [26683.825751] sd 0:0:2:0: [sdc] Write cache: enabled, read cache: enabled, supports DPO and FUA [26683.825754] sdc: detected capacity change from 480103981056 to 512 [26684.020835] sd 0:0:2:0: [sdc] Sector size 0 reported, assuming 512. [26946.615148] sd 0:0:3:0: [sdd] Sector size 0 reported, assuming 512. [26946.617214] sd 0:0:3:0: [sdd] 1 512-byte logical blocks: (512 B/512 B) [26946.617216] sd 0:0:3:0: [sdd] 0-byte physical blocks [26946.619055] sd 0:0:3:0: [sdd] Write cache: disabled, read cache: enabled, supports DPO and FUA [26946.620292] sdd: detected capacity change from 4000787030016 to 512 I have managed to work around the problem by limiting the CPU to 24 cores (48 threads) in BIOS and haven't been able to reproduce any corruption with such limitation but immediately when switching back to the full 32c/64t configuration the corruption starts happening again. In an effort of investigating the issue further I toyed around megaraid_sas module parameters and it would seem that setting smp_affinity_enable=0 to the module stops the problem from happening or at least makes it less likely to happen. At the time I'm writing this I have been running 4 hours of stress on the disks and haven't produced any corruption. Oh and the controller firmware log doesn't show any errors. OS is silent too unless using a checksumming FS such as BTRFS (or when something like XFS metadata gets hosed). Right now I'm at a loss on how to further debug the problem so here is my report. Feel free to ask for more details :) -- You are receiving this mail because: You are watching the assignee of the bug.