On 27/03/17 14:38, Andy Smith wrote:
Hi,
I'm attempting to clean up after what is most likely a
timeout-related double device failure (yes, I know).
I just want to check I have the right procedure here.
So, initial situation was a two device RAID-10 (sdc, sdd). sdc saw
some I/O errors and was kicked. Contents of /proc/mdstat after that:
md4 : active raid10 sdc[0](F) sdd[1]
3906886656 blocks super 1.2 512K chunks 2 far-copies [2/1] [_U]
bitmap: 7/30 pages [28KB], 65536KB chunk
A couple of hours later, sdd also saw some I/O errors and was
similarly kicked. Neither /dev/sdc nor sdd appear as device nodes in
the system any more at this point and the controller doesn't see
them.
sdd was re-plugged and re-appeared as sdg.
A mdadm --examine /dev/sdg looks like:
/dev/sdg:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 4100ddce:8edf6082:ba50427e:60da0a42
Name : elephant:4 (local to host elephant)
Creation Time : Fri Nov 18 22:53:10 2016
Raid Level : raid10
Raid Devices : 2
Avail Dev Size : 7813775024 (3725.90 GiB 4000.65 GB)
Array Size : 3906886656 (3725.90 GiB 4000.65 GB)
Used Dev Size : 7813773312 (3725.90 GiB 4000.65 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262056 sectors, after=1712 sectors
State : active
Device UUID : d9c9d81d:c487599a:3d3e3a30:0c512610
Internal Bitmap : 8 sectors from superblock
Update Time : Sun Mar 26 00:00:01 2017
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : ec70d450 - correct
Events : 298824
Layout : far=2
Chunk Size : 512K
Device Role : Active device 1
Array State : .A ('A' == active, '.' == missing, 'R' == replacing)
mdadm config:
$ grep -v '^#' /etc/mdadm/mdadm.conf | grep -v '^$'
DEVICE /dev/sd*
CREATE owner=root group=disk mode=0660 auto=yes
HOMEHOST <system>
MAILADDR root
ARRAY /dev/md/0 metadata=1.2 UUID=400bac1d:e2c5d6ef:fea3b8c8:bcb70f8f
ARRAY /dev/md/1 metadata=1.2 UUID=e29c8b89:705f0116:d888f77e:2b6e32f5
ARRAY /dev/md/2 metadata=1.2 UUID=039b3427:4be5157a:6e2d53bd:fe898803
ARRAY /dev/md/3 metadata=1.2 UUID=30f745ce:7ed41b53:4df72181:7406ea1d
ARRAY /dev/md/4 metadata=1.2 UUID=4100ddce:8edf6082:ba50427e:60da0a42
ARRAY /dev/md/5 metadata=1.2 UUID=957030cf:c09f023d:ceaebb27:e546f095
(other arrays are on different devices and are not involved here)
So, I think I need to:
- Increase /sys/block/sdg/device/timeout to 180 (already done). TLER
not supported.
- Stop md4.
mdadm --stop /dev/md4
- Assemble it again.
mdadm --assemble /dev/md4
Theory being that there is at least one good device (sdg that was
sdd).
- If that complains, I would then have to consider re-creating the
array with something like:
NEVER NEVER NEVER use --create except as a last resort. Try --assemble
--force. And if you are going to try it, as an absolute minimum, read
the kernel raid wiki, get lsdrv, run it AND MAKE SURE THE OUTPUT IS SAFE
SOMEWHERE.
https://raid.wiki.kernel.org/index.php/Asking_for_help
Snag is, you might end up with a non-functional array with two spare
drives. I'll have to step back and let the experts handle that if it
happens.
mdadm --create --assume-clean --level=10 --layout=f2 missing /dev/sdd
- Once it's up and running, add sdc back in and let it sync
- Make timeout changes permanent.
I'd do this as the very first step - I think you need to put a script in
your run-level. There's a good sample script on the wiki.
That way it'll get done as the system boots, and should prevent any
problems. Oh - and do scheduled scrubs, as the fact you're getting
timeout errors indicates that something is wrong - a scrub is probably
sufficient to clean it up.
Does that seem correct?
Hopefully fixing the timeout, followed by a --assemble --force, then a
scrub, will be all that's required.
I'm fairly confident that the drives themselves are actually okay -
nothing untoward in SMART data - so I'm not going to replace them at
this stage.
Cheers,
Wol
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html