Re: power outage while raid5->raid6 was in progress

Sebastian Reichel <elektranox@xxxxxxxxx> · Fri, 9 Jul 2010 01:50:02 +0200

On Fri, Jul 09, 2010 at 08:48:08AM +1000, Neil Brown wrote:
> On Thu, 8 Jul 2010 10:48:37 +0200
> Sebastian Reichel <elektranox@xxxxxxxxx> wrote:
> 
> > On Thu, Jul 08, 2010 at 11:21:44AM +1000, Neil Brown wrote:
> > > On Thu, 8 Jul 2010 01:13:16 +0200
> > > Sebastian Reichel <elektranox@xxxxxxxxx> wrote:
> > > 
> > > > On Thu, Jul 08, 2010 at 08:44:50AM +1000, Neil Brown wrote:
> > > > > On Wed, 7 Jul 2010 22:41:10 +0200
> > > > > Sebastian Reichel <elektranox@xxxxxxxxx> wrote:
> > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > I have some problems with my raid. I tried updating from 5 disks raid5 to 8 disks
> > > > > > raid6 as described on http://neil.brown.name/blog/20090817000931#2. The command I
> > > > > > used was: mdadm --grow /dev/md0 --level=6 --raid-disk=8
> > > > > > 
> > > > > > While the rebuild was in progress my system hung, so I had to force power down it.
> > > > > > After rebooting the system I reassembled the raid. You can see the resulting mess
> > > > > > below. How can I recover from this state?
> > > > > 
> > > > > Please report the output of
> > > > > 
> > > > >   mdadm -E /dev/sd[efghijkl]1
> > > > > 
> > > > > then I'll see what can be done.
> > > > 
> > > > thank you for having a look at it :)
> > > 
> > > 
> > > It appears that the RAID5 -> RAID6 conversion (which is instantaneous, but
> > > results in a non-standard RAID6 parity layout) happened, but the 
> > > 6disk -> 8disk reshape which would have been combined with producing a
> > > more standard RAID6 parity layout did not even begin.
> > > I don't know why that would be.  Do you remember seeing the reshape being
> > > under-way in /proc/mdstat at all??
> > >
> > > If you didn't them I am very confused and the following is not at all
> > > reliable.  If you didn't and you only assumed a reshape was happening, then
> > > read on.
> > > 
> > > [...]
> > > 
> > > But if you are sure the reshape actually started the first time, don't do
> > > any of this.  Rather try to find some earlier kernel logs that show the
> > > reshape starting, and maybe show what caused the crash.
> > 
> > Yes, I saw the reshape via /proc/mdstat, it already passed 30% iirc.
> > Here are the interesting parts of the kernel log:
> 
> Ahhh, that helps a lot - thanks.
> 
> mdadm successfully converted RAID5 -> RAID6 as I thought, but it hit a bug
> (that is aleady fixed in the -stable series - sorry about that).  So the
> 6disk -> 8disk conversion couldn't be requested.
> You stopped and restarted the array which worked and you now have a
> degraded RAID6 over 6 devices with a non-standard layout.

Yes, I read about this problem before, so I restarted it ;)

> As it was degraded and there was a spare it started recovering the missing
> devices to a spare.  You thought this was the reshape progressing and in a
> sense it was, just not exactly the reshape you were expecting.
> 
> You then hit read errors on sdk - at about the 42% mark if my sums are right.
> 
> Then it threw a warning about spending too much time in the thread that was
> trying to correct the read errors.  Then .... did it hang?

iirc it was like this:
 - mdadm reshap hang
 - system itself was still working
 - I couldn't get rid of the raid
 - clean shutdown did not work (probably because shutdown tries to cleanly umount?)

> Anyway, this all confirms that I was right about the state of your array.
> It is still a degraded 6disk RAID6 with a non-standard layout.
> 
> However sdk is clearly questionable.  You might want to exclude it from the
> array.

I guess it's the fault of the backplane, which already made some problems before.
I will check after sleeping.

> The rest of my advice about what to do holds.
> However you might want to get the latest 2.6.34.y -stable kernel if you can.

Ok, I will compile it after checking the backplane.

>> BTW, I'm curious about your hang.  You replaced a bunch of errors with
> > [...] (more sdk failures)
> 
> 2 questions:
>   1/ was this all "read error corrected" with consecutive addresses?  what
>      did the addresses go up to?
>   2/ did this continue for the full 4 minutes until the "INFO:" hang message?

Here is the part I replaced at that point:

Jul  6 23:54:20 mars kernel: [17622.181518] raid5:md0: read error corrected (8 sectors at 1241477680 on sdk1)
Jul  6 23:56:03 mars kernel: [17715.250058] rr272x_1x:Device error information 0x1000000
Jul  6 23:56:03 mars kernel: [17715.250064] rr272x_1x:Task file error, StatusReg=0x40, ErrReg=0x0, LBA[0-3]=0x4ac51577,LBA[4-7]=0x0.
Jul  6 23:56:03 mars kernel: [17715.250453] rr272x_1x:Device error information 0x1000000
Jul  6 23:56:03 mars kernel: [17715.250456] rr272x_1x:Task file error, StatusReg=0x51, ErrReg=0x4, LBA[0-3]=0x4ac51576,LBA[4-7]=0x0.
Jul  6 23:56:03 mars kernel: [17725.226318] sd 8:0:6:0: [sdk] Unhandled error code
Jul  6 23:56:03 mars kernel: [17725.226322] sd 8:0:6:0: [sdk] Result: hostbyte=DID_ABORT driverbyte=DRIVER_INVALID
Jul  6 23:56:03 mars kernel: [17725.226325] sd 8:0:6:0: [sdk] CDB: Read(10): 28 00 4a c5 15 77 00 00 48 00
Jul  6 23:56:03 mars kernel: [17725.226331] end_request: I/O error, dev sdk, sector 1254430071
[...] more sdk unhandled error code with increasing sector, see next line for the last one
Jul  6 23:56:03 mars kernel: [17725.227090] sd 8:0:6:0: [sdk] Unhandled error code
Jul  6 23:56:03 mars kernel: [17725.227091] sd 8:0:6:0: [sdk] Result: hostbyte=DID_ABORT driverbyte=DRIVER_INVALID
Jul  6 23:56:03 mars kernel: [17725.227094] sd 8:0:6:0: [sdk] CDB: Read(10): 28 00 4a c5 1a bf 00 00 38 00
Jul  6 23:56:03 mars kernel: [17725.227099] end_request: I/O error, dev sdk, sector 1254431423
Jul  6 23:56:03 mars kernel: [17725.265533] rr272x_1x:Device error information 0x1000000
Jul  6 23:56:03 mars kernel: [17725.265539] rr272x_1x:Task file error, StatusReg=0x41, ErrReg=0x4, LBA[0-3]=0x4ac51a8e,LBA[4-7]=0x0.
Jul  6 23:56:03 mars kernel: [17725.347683] rr272x_1x:Device error information 0x8000000080000000
Jul  6 23:58:38 mars kernel: [17880.544030] INFO: task md0_resync:13565 blocked for more than 120 seconds.

-- Sebastian

> > Jul  6 21:09:14 mars kernel: [ 7716.935683] RAID5 conf printout:
> > Jul  6 21:09:14 mars kernel: [ 7716.935688]  --- rd:5 wd:5
> > Jul  6 21:09:14 mars kernel: [ 7716.935691]  disk 0, o:1, dev:sdk1
> > Jul  6 21:09:14 mars kernel: [ 7716.935693]  disk 1, o:1, dev:sde1
> > Jul  6 21:09:14 mars kernel: [ 7716.935695]  disk 2, o:1, dev:sdl1
> > Jul  6 21:09:14 mars kernel: [ 7716.935696]  disk 3, o:1, dev:sdj1
> > Jul  6 21:09:14 mars kernel: [ 7716.935698]  disk 4, o:1, dev:sdi1
> > Jul  6 21:10:04 mars kernel: [ 7766.940183] raid5: device sdk1 operational as raid disk 0
> > Jul  6 21:10:04 mars kernel: [ 7766.940186] raid5: device sdi1 operational as raid disk 4
> > Jul  6 21:10:04 mars kernel: [ 7766.940189] raid5: device sdj1 operational as raid disk 3
> > Jul  6 21:10:04 mars kernel: [ 7766.940191] raid5: device sdl1 operational as raid disk 2
> > Jul  6 21:10:04 mars kernel: [ 7766.940193] raid5: device sde1 operational as raid disk 1
> > Jul  6 21:10:04 mars kernel: [ 7766.940840] raid5: allocated 6386kB for md0
> > Jul  6 21:10:04 mars kernel: [ 7766.952476] 0: w=1 pa=0 pr=6 m=2 a=18 r=6 op1=0 op2=0
> > Jul  6 21:10:04 mars kernel: [ 7766.952480] 4: w=2 pa=0 pr=6 m=2 a=18 r=6 op1=0 op2=0
> > Jul  6 21:10:04 mars kernel: [ 7766.952482] 3: w=3 pa=0 pr=6 m=2 a=18 r=6 op1=0 op2=0
> > Jul  6 21:10:04 mars kernel: [ 7766.952485] 2: w=4 pa=0 pr=6 m=2 a=18 r=6 op1=0 op2=0
> > Jul  6 21:10:04 mars kernel: [ 7766.952487] 1: w=5 pa=0 pr=6 m=2 a=18 r=6 op1=0 op2=0
> > Jul  6 21:10:04 mars kernel: [ 7766.952490] raid5: raid level 6 set md0 active with 5 out of 6 devices, algorithm 18
> > Jul  6 21:10:04 mars kernel: [ 7766.952516] RAID5 conf printout:
> > Jul  6 21:10:04 mars kernel: [ 7766.952517]  --- rd:6 wd:5
> > Jul  6 21:10:04 mars kernel: [ 7766.952520]  disk 0, o:1, dev:sdk1
> > Jul  6 21:10:04 mars kernel: [ 7766.952522]  disk 1, o:1, dev:sde1
> > Jul  6 21:10:04 mars kernel: [ 7766.952523]  disk 2, o:1, dev:sdl1
> > Jul  6 21:10:04 mars kernel: [ 7766.952525]  disk 3, o:1, dev:sdj1
> > Jul  6 21:10:04 mars kernel: [ 7766.952527]  disk 4, o:1, dev:sdi1
> > Jul  6 21:10:04 mars kernel: [ 7766.952536] ------------[ cut here ]------------
> > Jul  6 21:10:04 mars kernel: [ 7766.952542] WARNING: at /build/buildd-linux-2.6_2.6.34-1~experimental.2-amd64-zn0ozk/linux-2.6-2.6.34/debian/build/source_amd64_none/fs/sysfs/dir.c:451 sysfs_add_one+0xcc/0xe3()
> > Jul  6 21:10:04 mars kernel: [ 7766.952545] Hardware name: System Product Name
> > Jul  6 21:10:04 mars kernel: [ 7766.952547] sysfs: cannot create duplicate filename '/devices/virtual/block/md0/md/stripe_cache_size'
> > Jul  6 21:10:04 mars kernel: [ 7766.952549] Modules linked in: ip6t_LOG xt_hl nf_conntrack_ipv6 ipt_REJECT ipt_LOG xt_limit xt_tcpudp ipt_addrtype ipt_MASQUERADE iptable_nat xt_state ip6table_filter ip6_tables nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack iptable_filter ip_tables x_tables nfsd nfs lockd fscache nfs_acl auth_rpcgss sunrpc pppoe pppox ppp_generic xfs exportfs rr272x_1x(P) hwmon_vid loop sha256_generic aes_x86_64 aes_generic cbc dm_crypt dm_mod arc4 ecb dvb_pll ath5k hfcpci mac80211 stv0299 b2c2_flexcop_pci mISDN_core ath b2c2_flexcop hisax dvb_core snd_pcm cx24123 snd_timer crc_ccitt cfg80211 cx24113 isdn snd rfkill s5h1420 soundcore i2c_nforce2 evdev pcspkr led_class snd_page_alloc edac_core edac_mce_amd i2c_core k8temp slhc tpm_tis tpm tpm_bios processor button asus_atk0110 ext3 jbd mbcache raid456 md_mod async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx ohci_hcd sd_mod crc
>  _t10dif ata_generic fan firewire_ohci
> > Jul  6 21:10:04 mars kernel: ehci_hcd pata_amd firewire_core crc_itu_t thermal thermal_sys sata_nv usbcore nls_base libata forcedeth scsi_mod [last unloaded: scsi_wait_scan]
> > Jul  6 21:10:04 mars kernel: [ 7766.952622] Pid: 6844, comm: mdadm Tainted: P        W  2.6.34-1-amd64 #1
> > Jul  6 21:10:04 mars kernel: [ 7766.952624] Call Trace:
> > Jul  6 21:10:04 mars kernel: [ 7766.952630]  [<ffffffff81045655>] ? warn_slowpath_common+0x76/0x8c
> > Jul  6 21:10:04 mars kernel: [ 7766.952635]  [<ffffffff810456bd>] ? warn_slowpath_fmt+0x40/0x45
> > Jul  6 21:10:04 mars kernel: [ 7766.952638]  [<ffffffff81138209>] ? sysfs_add_one+0xcc/0xe3
> > Jul  6 21:10:04 mars kernel: [ 7766.952642]  [<ffffffff81137416>] ? sysfs_add_file_mode+0x4b/0x7d
> > Jul  6 21:10:04 mars kernel: [ 7766.952646]  [<ffffffff8113933c>] ? internal_create_group+0xdd/0x16b
> > Jul  6 21:10:04 mars kernel: [ 7766.952652]  [<ffffffffa01a0f67>] ? run+0x4fa/0x685 [raid456]
> > Jul  6 21:10:04 mars kernel: [ 7766.952660]  [<ffffffffa0181b20>] ? level_store+0x3b7/0x42e [md_mod]
> > Jul  6 21:10:04 mars kernel: [ 7766.952667]  [<ffffffffa017eb1f>] ? md_attr_store+0x77/0x96 [md_mod]
> > Jul  6 21:10:04 mars kernel: [ 7766.952670]  [<ffffffff81136ede>] ? sysfs_write_file+0xe3/0x11f
> > Jul  6 21:10:04 mars kernel: [ 7766.952674]  [<ffffffff810e7bc6>] ? vfs_write+0xa4/0x101
> > Jul  6 21:10:04 mars kernel: [ 7766.952677]  [<ffffffff810e7cd6>] ? sys_write+0x45/0x6b
> > Jul  6 21:10:04 mars kernel: [ 7766.952680]  [<ffffffff810089c2>] ? system_call_fastpath+0x16/0x1b
> > Jul  6 21:10:04 mars kernel: [ 7766.952682] ---[ end trace 2d8a2ef8dd7ca7b7 ]---
> > Jul  6 21:10:04 mars kernel: [ 7766.952687] raid5: failed to create sysfs attributes for md0
> > Jul  6 21:21:10 mars kernel: [ 8432.471242] md: md0 stopped.
> > [...]
> > Jul  6 21:25:23 mars kernel: [ 8685.659605] raid5: raid level 6 set md0 active with 5 out of 6 devices, algorithm 18
> > Jul  6 21:25:23 mars kernel: [ 8685.659631] RAID5 conf printout:
> > Jul  6 21:25:23 mars kernel: [ 8685.659632]  --- rd:6 wd:5
> > Jul  6 21:25:23 mars kernel: [ 8685.659634]  disk 0, o:1, dev:sdk1
> > Jul  6 21:25:23 mars kernel: [ 8685.659636]  disk 1, o:1, dev:sde1
> > Jul  6 21:25:23 mars kernel: [ 8685.659638]  disk 2, o:1, dev:sdl1
> > Jul  6 21:25:23 mars kernel: [ 8685.659640]  disk 3, o:1, dev:sdj1
> > Jul  6 21:25:23 mars kernel: [ 8685.659641]  disk 4, o:1, dev:sdi1
> > Jul  6 21:25:23 mars kernel: [ 8685.659677] md0: detected capacity change from 0 to 6001196793856
> > Jul  6 21:25:23 mars kernel: [ 8685.659813]  md0: unknown partition table
> > [...]
> > Jul  6 21:28:16 mars kernel: [ 8858.893402] md: recovery of RAID array md0
> > Jul  6 21:28:16 mars kernel: [ 8858.893405] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
> > Jul  6 21:28:16 mars kernel: [ 8858.893407] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
> > Jul  6 21:28:16 mars kernel: [ 8858.893422] md: using 128k window, over a total of 1465135936 blocks.
> > [...]
> > Jul  6 21:31:18 mars kernel: [ 9040.089434] rr272x_1x:Device error information 0x1000000
> > Jul  6 21:31:18 mars kernel: [ 9040.089440] rr272x_1x:Task file error, StatusReg=0x41, ErrReg=0x4, LBA[0-3]=0x17fbec6,LBA[4-7]=0x0.
> > [...] (more of these messages)
> > Jul  6 23:54:20 mars kernel: [17622.164846] sd 8:0:6:0: [sdk] Unhandled error code
> > Jul  6 23:54:20 mars kernel: [17622.164848] sd 8:0:6:0: [sdk] Result: hostbyte=DID_ABORT driverbyte=DRIVER_INVALID
> > Jul  6 23:54:20 mars kernel: [17622.164851] sd 8:0:6:0: [sdk] CDB: Read(10): 28 00 49 ff 72 3f 00 00 38 00
> > Jul  6 23:54:20 mars kernel: [17622.164858] end_request: I/O error, dev sdk, sector 1241477695
> > Jul  6 23:54:20 mars kernel: [17622.181377] raid5:md0: read error corrected (8 sectors at 1241477632 on sdk1)
> > Jul  6 23:54:20 mars kernel: [17622.181503] raid5:md0: read error corrected (8 sectors at 1241477640 on sdk1)
> > Jul  6 23:54:20 mars kernel: [17622.181507] raid5:md0: read error corrected (8 sectors at 1241477648 on sdk1)
> > Jul  6 23:54:20 mars kernel: [17622.181510] raid5:md0: read error corrected (8 sectors at 1241477656 on sdk1)
> > Jul  6 23:54:20 mars kernel: [17622.181512] raid5:md0: read error corrected (8 sectors at 1241477664 on sdk1)
> > Jul  6 23:54:20 mars kernel: [17622.181515] raid5:md0: read error corrected (8 sectors at 1241477672 on sdk1)
> > Jul  6 23:54:20 mars kernel: [17622.181518] raid5:md0: read error corrected (8 sectors at 1241477680 on sdk1)
> > [...] (more sdk failures)
> > Jul  6 23:58:38 mars kernel: [17880.544030] INFO: task md0_resync:13565 blocked for more than 120 seconds.
> > Jul  6 23:58:38 mars kernel: [17880.544109] md0_resync    D ffff88007cdf9500     0 13565      2 0x00000000
> > Jul  6 23:58:38 mars kernel: [17880.544114]  ffff88007cdf9500 0000000000000046 0000000000000086 ffff88007400a800
> > Jul  6 23:58:38 mars kernel: [17880.544117]  ffff88007ce25400 0000000000015200 0000000000015200 0000000000015200
> > Jul  6 23:58:38 mars kernel: [17880.544120]  ffff88006baf5fd8 0000000000015200 ffff88007cdf9500 ffff88006baf5fd8
> > Jul  6 23:58:38 mars kernel: [17880.544124] Call Trace:
> > Jul  6 23:58:38 mars kernel: [17880.544145]  [<ffffffffa019bdef>] ? unplug_slaves+0x70/0xa6 [raid456]
> > Jul  6 23:58:38 mars kernel: [17880.544150]  [<ffffffffa019c17e>] ? get_active_stripe+0x26f/0x54b [raid456]
> > Jul  6 23:58:38 mars kernel: [17880.544154]  [<ffffffff81040b38>] ? default_wake_function+0x0/0xf
> > Jul  6 23:58:38 mars kernel: [17880.544159]  [<ffffffffa019ef3b>] ? sync_request+0x245/0x2d1 [raid456]
> > Jul  6 23:58:38 mars kernel: [17880.544166]  [<ffffffffa01809d4>] ? is_mddev_idle+0xa2/0xf5 [md_mod]
> > Jul  6 23:58:38 mars kernel: [17880.544172]  [<ffffffffa018113a>] ? md_do_sync+0x713/0xb0f [md_mod]
> > Jul  6 23:58:38 mars kernel: [17880.544177]  [<ffffffff8105d495>] ? autoremove_wake_function+0x0/0x2a
> > Jul  6 23:58:38 mars kernel: [17880.544182]  [<ffffffffa0181c89>] ? md_thread+0xf2/0x110 [md_mod]
> > Jul  6 23:58:38 mars kernel: [17880.544188]  [<ffffffffa0181b97>] ? md_thread+0x0/0x110 [md_mod]
> > Jul  6 23:58:38 mars kernel: [17880.544190]  [<ffffffff8105d061>] ? kthread+0x75/0x7d
> > Jul  6 23:58:38 mars kernel: [17880.544194]  [<ffffffff810097e4>] ? kernel_thread_helper+0x4/0x10
> > Jul  6 23:58:38 mars kernel: [17880.544197]  [<ffffffff8105cfec>] ? kthread+0x0/0x7d
> > Jul  6 23:58:38 mars kernel: [17880.544199]  [<ffffffff810097e0>] ? kernel_thread_helper+0x0/0x10
Attachment:
signature.asc

Description: Digital signature