Hello,
I am experiencing issues with RAID6 on all kernel versions I have
tried
(3.18.12, 4.0.9, 4.1.12).
On 3.18.12, I am getting the following logged to dmesg:
896.874943] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error
-5
writing to inode 361858058 (offset 16777216 size 1052672 starting
block
5172953088)
[ 896.874945] Buffer I/O error on device md4, logical block
5172953088
[ 896.874947] Buffer I/O error on device md4, logical block
5172953089
[ 896.874948] Buffer I/O error on device md4, logical block
5172953090
[ 896.874949] Buffer I/O error on device md4, logical block
5172953091
[ 896.874950] Buffer I/O error on device md4, logical block
5172953092
[ 896.874950] Buffer I/O error on device md4, logical block
5172953093
[ 896.874951] Buffer I/O error on device md4, logical block
5172953094
[ 896.874952] Buffer I/O error on device md4, logical block
5172953095
[ 896.874953] Buffer I/O error on device md4, logical block
5172953096
[ 896.874953] Buffer I/O error on device md4, logical block
5172953097
[ 897.034829] EXT4-fs warning (device md4): ext4_end_bio:317: I/O
error -5
writing to inode 361858073 (offset 8388608 size 1052672 starting block
5172955136)
[ 897.122306] EXT4-fs warning (device md4): ext4_end_bio:317: I/O
error -5
writing to inode 361858073 (offset 8388608 size 2101248 starting block
5172955264)
[ 897.130547] EXT4-fs warning (device md4): ext4_end_bio:317: I/O
error -5
writing to inode 361858073 (offset 8388608 size 2101248 starting block
5172955392)
[ 897.355966] EXT4-fs warning (device md4): ext4_end_bio:317: I/O
error -5
writing to inode 361858073 (offset 8388608 size 2625536 starting block
5172955520)
[ 897.452464] EXT4-fs warning (device md4): ext4_end_bio:317: I/O
error -5
writing to inode 361858058 (offset 16777216 size 1576960 starting
block
5172953216)
[ 897.593480] EXT4-fs warning (device md4): ext4_end_bio:317: I/O
error -5
writing to inode 361858073 (offset 8388608 size 3149824 starting block
5172955648)
[ 897.877728] EXT4-fs warning (device md4): ext4_end_bio:317: I/O
error -5
writing to inode 361858073 (offset 8388608 size 3674112 starting block
5172955776)
[ 898.156331] EXT4-fs warning (device md4): ext4_end_bio:317: I/O
error -5
writing to inode 361858073 (offset 8388608 size 4198400 starting block
5172955904)
[ 898.176687] EXT4-fs warning (device md4): ext4_end_bio:317: I/O
error -5
writing to inode 361858058 (offset 16777216 size 2101248 starting
block
5172953344)
When this happens, I end up with a file on the array which is
partially
corrupt. For example, if i copied a jpeg file, parts of the image
would be
garbage.
I initially thought that this could be a kernel issue, so I tried two
further kernel versions (4.0.9 & 4.1.12) and on both, I don't get the
above
messages anymore, instead I get a kernel oops and any process
accessing the
array will get stuck in state D. Here is a typical kernel oops
message:
[ 158.138253] BUG: unable to handle kernel NULL pointer dereference
at
0000000000000120
[ 158.138391] IP: [<ffffffffa024cc1f>] handle_stripe+0xdc0/0x1e1f
[raid456]
[ 158.138482] PGD 24ff59067 PUD 24fe43067 PMD 0
[ 158.138646] Oops: 0000 [#1] SMP
[ 158.138758] Modules linked in: ipv6 binfmt_misc joydev
x86_pkg_temp_thermal coretemp kvm_intel kvm microcode pcspkr video
i2c_i801
thermal acpi_cpufreq fan battery rtc_cmos backlight processor
thermal_sys
xhci_pci button xts gf128mul aes_x86_64 cbc sha256_generic
scsi_transport_iscsi multipath linear raid10 raid456 async_raid6_recov
async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0
dm_snapshot dm_bufio dm_crypt dm_mirror dm_region_hash dm_log dm_mod
hid_sunplus hid_sony led_class hid_samsung hid_pl hid_petalynx
hid_monterey
hid_microsoft hid_logitech hid_gyration hid_ezkey hid_cypress
hid_chicony
hid_cherry hid_belkin hid_apple hid_a4tech sl811_hcd usbhid xhci_hcd
ohci_hcd uhci_hcd usb_storage ehci_pci ehci_hcd usbcore usb_common
megaraid_sas megaraid_mbox megaraid_mm megaraid sx8
[ 158.141809] DAC960 cciss mptsas mptfc scsi_transport_fc mptspi
scsi_transport_spi mptscsih mptbase sg
[ 158.142226] CPU: 0 PID: 2017 Comm: md4_raid6 Not tainted
4.1.12-gentoo #1
[ 158.142272] Hardware name: Supermicro X10SAT/X10SAT, BIOS 2.0
04/21/2014
[ 158.142323] task: ffff880254267050 ti: ffff880095afc000 task.ti:
ffff880095afc000
[ 158.142376] RIP: 0010:[<ffffffffa024cc1f>] [<ffffffffa024cc1f>]
handle_stripe+0xdc0/0x1e1f [raid456]
[ 158.142493] RSP: 0018:ffff880095affc18 EFLAGS: 00010202
[ 158.142554] RAX: 000000000000000d RBX: ffff880095cfac00 RCX:
0000000000000002
[ 158.142617] RDX: 000000000000000d RSI: 0000000000000000 RDI:
0000000000001040
[ 158.142682] RBP: ffff880095affcf8 R08: 0000000000000003 R09:
00000000cd920408
[ 158.142745] R10: 000000000000000d R11: 0000000000000007 R12:
000000000000000d
[ 158.142809] R13: 0000000000000000 R14: 000000000000000c R15:
ffff8802161f2588
[ 158.142873] FS: 0000000000000000(0000) GS:ffff88025ea00000(0000)
knlGS:0000000000000000
[ 158.142938] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 158.143000] CR2: 0000000000000120 CR3: 0000000253ef4000 CR4:
00000000001406f0
[ 158.143062] Stack:
[ 158.143117] 0000000000000000 ffff880254267050 00000000000147c0
0000000000000000
[ 158.143328] ffff8802161f25d0 0000000effffffff ffff8802161f3670
ffff8802161f2ef0
[ 158.143537] 0000000000000000 0000000000000000 0000000000000000
0000000c00000000
[ 158.143747] Call Trace:
[ 158.143805] [<ffffffffa024dea3>]
handle_active_stripes.isra.37+0x225/0x2aa [raid456]
[ 158.143873] [<ffffffffa024e31d>] raid5d+0x363/0x40d [raid456]
[ 158.143937] [<ffffffff814315bc>] ? schedule+0x6f/0x7e
[ 158.143998] [<ffffffff81372ae7>] md_thread+0x125/0x13b
[ 158.144060] [<ffffffff81061b00>] ? wait_woken+0x71/0x71
[ 158.144122] [<ffffffff813729c2>] ? md_start_sync+0xda/0xda
[ 158.144185] [<ffffffff81050609>] kthread+0xcd/0xd5
[ 158.144244] [<ffffffff8105053c>] ?
kthread_create_on_node+0x16d/0x16d
[ 158.144309] [<ffffffff81434f92>] ret_from_fork+0x42/0x70
[ 158.144370] [<ffffffff8105053c>] ?
kthread_create_on_node+0x16d/0x16d
[ 158.144432] Code: 8c 0f d0 01 00 00 48 8b 49 10 80 e1 10 74 0d 49
8b 4f
48 80 e1 40 0f 84 c2 0f 00 00 31 c9 41 39 c8 7e 31 48 8b b4 cd 50 ff
ff ff
<48> 83 be 20 01 00 00 00 74 1a 48 8b be 38 01 00 00 40 80 e7 01
[ 158.147700] RIP [<ffffffffa024cc1f>] handle_stripe+0xdc0/0x1e1f
[raid456]
[ 158.147801] RSP <ffff880095affc18>
[ 158.147859] CR2: 0000000000000120
[ 158.147916] ---[ end trace 536b72bd7c91f068 ]---
In both cases, discs are never flagged as faulty and the array never
goes
into a degraded state.
I have tried posting this in various forums with no solution so far.
A post
with further information can be found here:
https://forums.gentoo.org/viewtopic-t-1032304.html - In that topic I
have
supplied output from various commands that people have asked me to
execute.
Rather than pasting all the output from these commands here have
linked to
the thread instead.
Any Idea's what could be going on? Any help would be greatly
appreciated.