Re: md RAID5: Disk wrongly marked "spare", need to force re-add it

linux.news@xxxxxxxxxxx · Sat, 20 Apr 2013 00:56:17 +0200

Maarten wrote, On 18.04.2013 15:58:
On 18/04/13 15:17, Ben Bucksch wrote:
To re-summarize (for full info, see first post of thread):
* There are 2 RAID5 arrays in the machine, each have 8 disks.
* I upgraded Ubuntu 10.04 to 12.04.
* After reboot, both arrays had each ejected one disk.
   The ejected disks are working fine (at least now).
* During the resync mandated by above ejection,
    one other drive failed, this one fatally with a real hardware failure.
* The second array resynced fine, further proving that the
    disks ejected during upgrade were working.
* Now I am left with: originally 8-disk RAID5, 6 disks are healthy,
   1 disk with hardware failure, and 1 disk that was ejected, but is
working.
* The latter is currently marked "spare" by md and has an event count
   (only) 2 events lower than the other 6 disks.
* My task is to get the latter disk back online *with* its data, without
resync.

I desperately need help, please.

Based on suggestions here by Oliver and on forums, I did (and the result
is):

# mdadm --stop /dev/md0
mdadm: stopped /dev/md0
# mdadm --assemble --run --force /dev/md0 /dev/sd[jlmnopq]
mdadm: failed to RUN_ARRAY /dev/md0:
mdadm: Not enough devices to start the array.
At this point, does dmesg show anything pointing to that input/output
error ? The procedure is correct

[630786.513314] md: md0 stopped.
[630786.513341] md: unbind<sdl>
[630786.590662] md: export_rdev(sdl)
[630786.590744] md: unbind<sdj>
[630786.670652] md: export_rdev(sdj)
[630786.670887] md: unbind<sdq>
[630786.750650] md: export_rdev(sdq)
[630786.750707] md: unbind<sdn>
[630786.830649] md: export_rdev(sdn)
[630786.830712] md: unbind<sdp>
[630786.910651] md: export_rdev(sdp)
[630786.910710] md: unbind<sdo>
[630786.990649] md: export_rdev(sdo)
[630786.990700] md: unbind<sdm>
[630787.070649] md: export_rdev(sdm)
[630793.315121] md: md0 stopped.
[630794.785328] md: bind<sdm>
[630794.785512] md: bind<sdo>
[630794.785695] md: bind<sdp>
[630794.785891] md: bind<sdn>
[630794.786643] md: bind<sdq>
[630794.787009] md: bind<sdl>
[630794.788164] md: bind<sdj>
[630794.788236] md: kicking non-fresh sdl from array!
[630794.788250] md: unbind<sdl>
[630794.810082] md: export_rdev(sdl)
[630794.812725] raid5: device sdj operational as raid disk 0
[630794.812734] raid5: device sdq operational as raid disk 7
[630794.812740] raid5: device sdn operational as raid disk 6
[630794.812745] raid5: device sdp operational as raid disk 5
[630794.812750] raid5: device sdo operational as raid disk 4
[630794.812755] raid5: device sdm operational as raid disk 3
[630794.813895] raid5: allocated 8490kB for md0
[630794.813966] 0: w=1 pa=0 pr=8 m=1 a=2 r=8 op1=0 op2=0
[630794.813974] 7: w=2 pa=0 pr=8 m=1 a=2 r=8 op1=0 op2=0
[630794.813980] 6: w=3 pa=0 pr=8 m=1 a=2 r=8 op1=0 op2=0
[630794.813986] 5: w=4 pa=0 pr=8 m=1 a=2 r=8 op1=0 op2=0
[630794.813993] 4: w=5 pa=0 pr=8 m=1 a=2 r=8 op1=0 op2=0
[630794.813999] 3: w=6 pa=0 pr=8 m=1 a=2 r=8 op1=0 op2=0
[630794.814005] raid5: not enough operational devices for md0 (2/8 failed)
[630794.820671] RAID5 conf printout:
[630794.820675]  --- rd:8 wd:6
[630794.820680]  disk 0, o:1, dev:sdj
[630794.820685]  disk 3, o:1, dev:sdm
[630794.820689]  disk 4, o:1, dev:sdo
[630794.820693]  disk 5, o:1, dev:sdp
[630794.820697]  disk 6, o:1, dev:sdn
[630794.820701]  disk 7, o:1, dev:sdq
[630794.820945] raid5: failed to run raid set md0
[630794.826530] md: pers->run() failed ...
[630794.834455] md: export_rdev(sdl)
[630794.834463] md: export_rdev(sdl)

The problem is:
md: kicking non-fresh sdl from array!
thus:
raid5: not enough operational devices for md0 (2/8 failed)

# mdadm -E /dev/sdl
  Checksum : ca6e81a9 - correct      Events : 13274863
# mdadm -E /dev/sdn
  Checksum : c9a41046 - correct      Events : 13274865

So, the question is: How do I convince md not to be so anal retentive 
and prevent me from accessing any of my data? The drive ***is fine***, 
has practically all the data (I don't care about these 2 events), just 
use it already. Nobody seems to know the magic shell commands to do that.

The lack of a proper shell command for that effectively constitutes a 
dataloss bug. I've been patient, but I'm getting more and more upset at md.
Thanks, Maarten, for your help. I hope 1) you or anybody else can help 
me, and I hope 2) these kinds of problems will be fixed once and for 
good by the devs.

Good luck!

Thanks.

Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html