Re: Failed, but "md: cannot remove active disk..."

NeilBrown <neilb@xxxxxxx> · Mon, 14 May 2012 20:22:20 +1000

On Sun, 13 May 2012 20:21:48 +0200 Michał Sawicz <michal@xxxxxxxxxx> wrote:

> Hey,
> 
> I've a weird issue with a RAID6 setup, /proc/mdstat says:
> 
> > md126 : active raid6 sda1[3] sdh1[6] sdg1[0](F) sdf1[5] sdi1[1] sdc[8] sdb[7]
> >       9767559680 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [_UUUUUU]
> 
> So sdg1 is (F)ailed, yet `mdadm --remove` yields:
> 
> > md: cannot remove active disk sdg1 from md126 ...

There is a period of time between when a device fails and when the raid456
module finally lets go of it so it can be removed.  You seem to be in this
period of time.
Normally it is very short.  It needs to wait for any requests that have
already been sent to the device to complete (probably with failure) and
very shortly after that it should be released.  So this is normally much less
than one second but could be several seconds is some excessive retry is
happening.

But I'm guessing you have waited more than a few seconds.

I vaguely recall a bug in the not too distant past whereby RAID456 wouldn't
let go of a device quite as soon as it should.  Unfortunately I don't
remember the details.  You might be able to trigger it to release the drive
by adding a spare - if you have one - or maybe by just
  echo sync > /sys/block/md126/md/sync_action
it won't actually do a sync, but it might check things enough to make
progress.

What kernel are you using?

NeilBrown

> 
> in dmesg...
> 
> `mdadm --examine` shows:
> 
> > /dev/sdg1:
> >           Magic : a92b4efc
> >         Version : 1.2
> >     Feature Map : 0x0
> >      Array UUID : ff9e032c:446ed0bd:fc9473f3:f8e090ed
> >            Name : media:store  (local to host media)
> >   Creation Time : Tue Sep 13 21:36:43 2011
> >      Raid Level : raid6
> >    Raid Devices : 7
> > 
> > Avail Dev Size : 3907024896 (1863.01 GiB 2000.40 GB)
> >      Array Size : 19535119360 (9315.07 GiB 10001.98 GB)
> >   Used Dev Size : 3907023872 (1863.01 GiB 2000.40 GB)
> >     Data Offset : 2048 sectors
> >    Super Offset : 8 sectors
> >           State : clean
> >     Device UUID : 4bcee8e2:709419b6:fbeb3a8e:5c9bb68a
> > 
> >     Update Time : Sat May 12 21:57:27 2012
> >        Checksum : ffb03189 - correct
> >          Events : 304564
> > 
> >          Layout : left-symmetric
> >      Chunk Size : 512K
> > 
> >    Device Role : Active device 0
> >    Array State : AAAAAAA ('A' == active, '.' == missing)
> 
> So that superblock thinks it's active, but that's normal, right? It
> wasn't updated due to fail? Others correctly show:
> 
> > dev/sdc:
> >           Magic : a92b4efc
> >         Version : 1.2
> >     Feature Map : 0x0
> >      Array UUID : ff9e032c:446ed0bd:fc9473f3:f8e090ed
> >            Name : media:store  (local to host media)
> >   Creation Time : Tue Sep 13 21:36:43 2011
> >      Raid Level : raid6
> >    Raid Devices : 7
> > 
> >  Avail Dev Size : 3907027120 (1863.02 GiB 2000.40 GB)
> >      Array Size : 19535119360 (9315.07 GiB 10001.98 GB)
> >   Used Dev Size : 3907023872 (1863.01 GiB 2000.40 GB)
> >     Data Offset : 2048 sectors
> >    Super Offset : 8 sectors
> >           State : clean
> >     Device UUID : b713fd2b:eef145b0:ce91de0a:9077554b
> > 
> >     Update Time : Sat May 12 21:57:57 2012
> >        Checksum : 80345876 - correct
> >          Events : 304581
> > 
> >          Layout : left-symmetric
> >      Chunk Size : 512K
> > 
> >    Device Role : Active device 2
> >    Array State : .AAAAAA ('A' == active, '.' == missing)
> 
> Any ideas?
> 
> Cheers,

Attachment:
signature.asc

Description: PGP signature