Re: Buffer I/O errors & Kernel OOPS with RAID6

Shaohua Li <shli@xxxxxxxxxx> · Mon, 9 Nov 2015 09:35:07 -0800



On Mon, Nov 09, 2015 at 11:40:00AM +0000, matt@xxxxxxxxxxxxxxxxxxx wrote:
> Hello,
> 
> I am experiencing issues with RAID6 on all kernel versions I have tried
> (3.18.12, 4.0.9, 4.1.12).
> 
> On 3.18.12, I am getting the following logged to dmesg:
> 
> 896.874943] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5
> writing to inode 361858058 (offset 16777216 size 1052672 starting block
> 5172953088)
> [  896.874945] Buffer I/O error on device md4, logical block 5172953088
> [  896.874947] Buffer I/O error on device md4, logical block 5172953089
> [  896.874948] Buffer I/O error on device md4, logical block 5172953090
> [  896.874949] Buffer I/O error on device md4, logical block 5172953091
> [  896.874950] Buffer I/O error on device md4, logical block 5172953092
> [  896.874950] Buffer I/O error on device md4, logical block 5172953093
> [  896.874951] Buffer I/O error on device md4, logical block 5172953094
> [  896.874952] Buffer I/O error on device md4, logical block 5172953095
> [  896.874953] Buffer I/O error on device md4, logical block 5172953096
> [  896.874953] Buffer I/O error on device md4, logical block 5172953097
> [  897.034829] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5
> writing to inode 361858073 (offset 8388608 size 1052672 starting block
> 5172955136)
> [  897.122306] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5
> writing to inode 361858073 (offset 8388608 size 2101248 starting block
> 5172955264)
> [  897.130547] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5
> writing to inode 361858073 (offset 8388608 size 2101248 starting block
> 5172955392)
> [  897.355966] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5
> writing to inode 361858073 (offset 8388608 size 2625536 starting block
> 5172955520)
> [  897.452464] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5
> writing to inode 361858058 (offset 16777216 size 1576960 starting block
> 5172953216)
> [  897.593480] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5
> writing to inode 361858073 (offset 8388608 size 3149824 starting block
> 5172955648)
> [  897.877728] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5
> writing to inode 361858073 (offset 8388608 size 3674112 starting block
> 5172955776)
> [  898.156331] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5
> writing to inode 361858073 (offset 8388608 size 4198400 starting block
> 5172955904)
> [  898.176687] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5
> writing to inode 361858058 (offset 16777216 size 2101248 starting block
> 5172953344)
> 
> When this happens, I end up with a file on the array which is partially
> corrupt.  For example, if i copied a jpeg file, parts of the image would be
> garbage.
> 
> I initially thought that this could be a kernel issue, so I tried two
> further kernel versions (4.0.9 & 4.1.12) and on both, I don't get the above
> messages anymore, instead I get a kernel oops and any process accessing the
> array will get stuck in state D.  Here is a typical kernel oops message:
> 
> [  158.138253] BUG: unable to handle kernel NULL pointer dereference at
> 0000000000000120
> [  158.138391] IP: [<ffffffffa024cc1f>] handle_stripe+0xdc0/0x1e1f [raid456]
> [  158.138482] PGD 24ff59067 PUD 24fe43067 PMD 0
> [  158.138646] Oops: 0000 [#1] SMP
> [  158.138758] Modules linked in: ipv6 binfmt_misc joydev
> x86_pkg_temp_thermal coretemp kvm_intel kvm microcode pcspkr video i2c_i801
> thermal acpi_cpufreq fan battery rtc_cmos backlight processor thermal_sys
> xhci_pci button xts gf128mul aes_x86_64 cbc sha256_generic
> scsi_transport_iscsi multipath linear raid10 raid456 async_raid6_recov
> async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0
> dm_snapshot dm_bufio dm_crypt dm_mirror dm_region_hash dm_log dm_mod
> hid_sunplus hid_sony led_class hid_samsung hid_pl hid_petalynx hid_monterey
> hid_microsoft hid_logitech hid_gyration hid_ezkey hid_cypress hid_chicony
> hid_cherry hid_belkin hid_apple hid_a4tech sl811_hcd usbhid xhci_hcd
> ohci_hcd uhci_hcd usb_storage ehci_pci ehci_hcd usbcore usb_common
> megaraid_sas megaraid_mbox megaraid_mm megaraid sx8
> [  158.141809]  DAC960 cciss mptsas mptfc scsi_transport_fc mptspi
> scsi_transport_spi mptscsih mptbase sg
> [  158.142226] CPU: 0 PID: 2017 Comm: md4_raid6 Not tainted 4.1.12-gentoo #1
> [  158.142272] Hardware name: Supermicro X10SAT/X10SAT, BIOS 2.0 04/21/2014
> [  158.142323] task: ffff880254267050 ti: ffff880095afc000 task.ti:
> ffff880095afc000
> [  158.142376] RIP: 0010:[<ffffffffa024cc1f>]  [<ffffffffa024cc1f>]
> handle_stripe+0xdc0/0x1e1f [raid456]
> [  158.142493] RSP: 0018:ffff880095affc18 EFLAGS: 00010202
> [  158.142554] RAX: 000000000000000d RBX: ffff880095cfac00 RCX:
> 0000000000000002
> [  158.142617] RDX: 000000000000000d RSI: 0000000000000000 RDI:
> 0000000000001040
> [  158.142682] RBP: ffff880095affcf8 R08: 0000000000000003 R09:
> 00000000cd920408
> [  158.142745] R10: 000000000000000d R11: 0000000000000007 R12:
> 000000000000000d
> [  158.142809] R13: 0000000000000000 R14: 000000000000000c R15:
> ffff8802161f2588
> [  158.142873] FS:  0000000000000000(0000) GS:ffff88025ea00000(0000)
> knlGS:0000000000000000
> [  158.142938] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  158.143000] CR2: 0000000000000120 CR3: 0000000253ef4000 CR4:
> 00000000001406f0
> [  158.143062] Stack:
> [  158.143117]  0000000000000000 ffff880254267050 00000000000147c0
> 0000000000000000
> [  158.143328]  ffff8802161f25d0 0000000effffffff ffff8802161f3670
> ffff8802161f2ef0
> [  158.143537]  0000000000000000 0000000000000000 0000000000000000
> 0000000c00000000
> [  158.143747] Call Trace:
> [  158.143805]  [<ffffffffa024dea3>]
> handle_active_stripes.isra.37+0x225/0x2aa [raid456]
> [  158.143873]  [<ffffffffa024e31d>] raid5d+0x363/0x40d [raid456]
> [  158.143937]  [<ffffffff814315bc>] ? schedule+0x6f/0x7e
> [  158.143998]  [<ffffffff81372ae7>] md_thread+0x125/0x13b
> [  158.144060]  [<ffffffff81061b00>] ? wait_woken+0x71/0x71
> [  158.144122]  [<ffffffff813729c2>] ? md_start_sync+0xda/0xda
> [  158.144185]  [<ffffffff81050609>] kthread+0xcd/0xd5
> [  158.144244]  [<ffffffff8105053c>] ? kthread_create_on_node+0x16d/0x16d
> [  158.144309]  [<ffffffff81434f92>] ret_from_fork+0x42/0x70
> [  158.144370]  [<ffffffff8105053c>] ? kthread_create_on_node+0x16d/0x16d
> [  158.144432] Code: 8c 0f d0 01 00 00 48 8b 49 10 80 e1 10 74 0d 49 8b 4f
> 48 80 e1 40 0f 84 c2 0f 00 00 31 c9 41 39 c8 7e 31 48 8b b4 cd 50 ff ff ff
> <48> 83 be 20 01 00 00 00 74 1a 48 8b be 38 01 00 00 40 80 e7 01
> [  158.147700] RIP [<ffffffffa024cc1f>] handle_stripe+0xdc0/0x1e1f [raid456]
> [  158.147801]  RSP <ffff880095affc18>
> [  158.147859] CR2: 0000000000000120
> [  158.147916] ---[ end trace 536b72bd7c91f068 ]---
> 
> In both cases, discs are never flagged as faulty and the array never goes
> into a degraded state.
> 
> I have tried posting this in various forums with no solution so far.  A post
> with further information can be found here:
> https://forums.gentoo.org/viewtopic-t-1032304.html - In that topic I have
> supplied output from various commands that people have asked me to execute.
> Rather than pasting all the output from these commands here have linked to
> the thread instead.
> 
> Any Idea's what could be going on? Any help would be greatly appreciated.

Could you please try a upstream kernel? there are some fixes in error handling
side recently, might be related.
ebda780bce8d58ec0ab
36707bb2e7c6730d79
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html