Re: RAID 6 Reshape Woes

Phil Turmel <philip@xxxxxxxxxx> · Wed, 18 Nov 2015 20:23:40 -0500

On 11/18/2015 08:07 PM, Francisco Parada wrote:
> Resending, previous message got rejected due to “HTML”.  Damn Apple Mail ;-)

Heh, but let me fix that typo:  Damn Apple ;-)

> Hi all,
> 
> I thought I had corrected all the flaws in my setup, but I was mistaken.  I took care of my hard drive timeout mismatch encountered via a thread a little over a week ago, subjected “RAID 6 Not Mounting (Block device is empty), by adding “smartctl -l scterc,70,70 /dev/sdX” and “for x in /sys/block/*/device/timeout ; do echo 180 > $x ; done” to my boot scripts.  I took care of my PSU issue, by replacing my enclosure’s defective PSU, with a new PSU which tested out OK with a multimeter.  Today, however, I report some bad news once again.  

Ugly.

> After having stressed my rebuilt array for a few days, by adding large sums of data and noting no further syslog errors, I decided that I could not live with 18GB of disk space remaining.  Since my last post, I’ve accumulated an additional Terrabyte, and so I ran out of space.  At the ready, I had a spare drive, so I decided to run "mdadm --grow --raid-devices=7 --backup-file=/root/grow_md126.bak /dev/md126”, to go from a 6 drive RAID 6 array to my 7 drive array.  All was good for about a minute, and then my nightmare began.  Luckily, I have a backup of prior to my Terrabyte, which is alright if I lose, just rather not.

Time to toss some enclosures and/or cables.

> mdstat output:
> ====================================================================================
> Every 1.0s: cat /proc/mdstat                                                                                        Wed Nov 18 19:25:02 2015
> 
> Personalities : [raid0] [linear] [multipath] [raid1] [raid6] [raid5] [raid4] [raid10]
> md126 : active raid6 sdh[0](F) sdk[6] sdg[5](F) sdf[4](F) sde[3](F) sdj[2] sdi[1]
>       11720540160 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/3] [_UU___U]
>       [>....................]  reshape =  0.0% (2726560/2930135040) finish=193325.8min speed=252K/sec
>       bitmap: 1/22 pages [4KB], 65536KB chunk

Hmmm.  Slow as molasses.

> The device is still mounted and I can access all the data in it.

Probably not.  You are just seeing kernel block cache effects, I suspect.

> At 18:55:24, I started my rebuild:
> =====================================================================================================
> Nov 18 18:55:24 DoctorBanner mdadm[1127]: RebuildStarted event detected on md device /dev/md126
> =====================================================================================================

Uhm, what?  What command or action did you take?  Or are you simply
doing a "flashback" to the start of this process?

> Then 3 seconds later (18:55:27), the first “reshape interrupted” message appeared, but I didn’t notice, because the array was chugging along at 9KB/s according to /proc/mdstat:
> =====================================================================================================
> Nov 18 18:55:27 DoctorBanner kernel: [77563.553030] md: md126: reshape interrupted.
> =====================================================================================================
> 
> At some point before the following entries and after starting the reshape, I ran “echo 50000 > /proc/sys/dev/raid/speed_limit_min” to help speed up the reshape, and so I think this is what started causing the issue.
> 
> It continued to reshape for about 5 minutes, and then things got really ugly:
> =====================================================================================================
> Nov 18 19:00:31 DoctorBanner kernel: [77868.163377] ata7.00: failed to read SCR 1 (Emask=0x40)
> Nov 18 19:00:31 DoctorBanner kernel: [77868.163382] ata7.01: failed to read SCR 1 (Emask=0x40)
> Nov 18 19:00:31 DoctorBanner kernel: [77868.163384] ata7.02: failed to read SCR 1 (Emask=0x40)
> Nov 18 19:00:31 DoctorBanner kernel: [77868.163385] ata7.03: failed to read SCR 1 (Emask=0x40)
> Nov 18 19:00:31 DoctorBanner kernel: [77868.163386] ata7.04: failed to read SCR 1 (Emask=0x40)
> Nov 18 19:00:31 DoctorBanner kernel: [77868.163388] ata7.05: failed to read SCR 1 (Emask=0x40)
> Nov 18 19:00:31 DoctorBanner kernel: [77868.163392] ata7.15: exception Emask 0x10 SAct 0x0 SErr 0x400000 action 0x6 frozen
> Nov 18 19:00:31 DoctorBanner kernel: [77868.163394] ata7.15: irq_stat 0x08000000, interface fatal error
> Nov 18 19:00:31 DoctorBanner kernel: [77868.163397] ata7.15: SError: { Handshk }
> Nov 18 19:00:31 DoctorBanner kernel: [77868.163399] ata7.00: exception Emask 0x100 SAct 0x0 SErr 0x0 action 0x6 frozen
> Nov 18 19:00:31 DoctorBanner kernel: [77868.163402] ata7.00: failed command: WRITE DMA EXT
> Nov 18 19:00:31 DoctorBanner kernel: [77868.163406] ata7.00: cmd 35/00:40:40:fd:56/00:05:00:00:00/e0 tag 23 dma 688128 out
> Nov 18 19:00:31 DoctorBanner kernel: [77868.163406]          res 50/00:00:7f:6b:6c/00:00:00:00:00/e0 Emask 0x100 (unknown error)
> Nov 18 19:00:31 DoctorBanner kernel: [77868.163408] ata7.00: status: { DRDY }
> Nov 18 19:00:31 DoctorBanner kernel: [77868.163410] ata7.01: exception Emask 0x100 SAct 0x0 SErr 0x0 action 0x6 frozen
> Nov 18 19:00:31 DoctorBanner kernel: [77868.163412] ata7.02: exception Emask 0x100 SAct 0x0 SErr 0x0 action 0x6 frozen
> Nov 18 19:00:31 DoctorBanner kernel: [77868.163414] ata7.03: exception Emask 0x100 SAct 0x0 SErr 0x0 action 0x6 frozen
> Nov 18 19:00:31 DoctorBanner kernel: [77868.163416] ata7.04: exception Emask 0x100 SAct 0x0 SErr 0x0 action 0x6 frozen
> Nov 18 19:00:31 DoctorBanner kernel: [77868.163418] ata7.05: exception Emask 0x100 SAct 0x0 SErr 0x0 action 0x6 frozen
> Nov 18 19:00:31 DoctorBanner kernel: [77868.163422] ata7.15: hard resetting link
> Nov 18 19:00:41 DoctorBanner kernel: [77878.160885] ata7.15: softreset failed (1st FIS failed)
> Nov 18 19:00:41 DoctorBanner kernel: [77878.160893] ata7.15: hard resetting link
> Nov 18 19:00:51 DoctorBanner kernel: [77888.162415] ata7.15: softreset failed (1st FIS failed)
> Nov 18 19:00:51 DoctorBanner kernel: [77888.162423] ata7.15: hard resetting link
> Nov 18 19:01:26 DoctorBanner kernel: [77923.153671] ata7.15: softreset failed (1st FIS failed)
> Nov 18 19:01:26 DoctorBanner kernel: [77923.153679] ata7.15: limiting SATA link speed to 1.5 Gbps
> Nov 18 19:01:26 DoctorBanner kernel: [77923.153683] ata7.15: hard resetting link
> Nov 18 19:01:31 DoctorBanner kernel: [77928.160337] ata7.15: softreset failed (1st FIS failed)
> Nov 18 19:01:31 DoctorBanner kernel: [77928.160344] ata7.15: failed to reset PMP, giving up
> Nov 18 19:01:31 DoctorBanner kernel: [77928.160347] ata7.15: Port Multiplier detaching
> =====================================================================================================
> 
> 
> Which then proceeded to rejecting I/O and offlining devices (full syslog attached).
> 
> I’m kind of alright with losing this one, since now I have a decent backup.  But is it even possible to recover from something like a failure this while it’s reshaping?

Stop the array completely.  Use --assemble --force with all of the
drives, including the new one.  Include the same --backup-file.

> I’m going to start chalking it up to the PCIe Port Multiplier being the root of the problem.

Likely.  Are the port multipliers capable of the same speeds as the
drives and controllers?

> What do you guys think?

New enclosures & controllers so you can ditch the port multipliers?

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html