Re: Persistent failures with simple md setup

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 04 Feb 2013 21:43:29 +0100 Hans-Peter Jansen <hpj@xxxxxxxxx> wrote:

> Am Mittwoch, 30. Januar 2013, 18:12:46 schrieb Hans-Peter Jansen:
> > 
> > Hmm, according to mdadm from openSUSE:12.1:Update, the relevant fixes should
> > be in place. It might be an unfortunate combination of this issue and the
> > asynchronously applied updates, interfered by the *switching* behavior.
> > 
> > I started with regenerating the initrds now, and a first reboot succeeded so
> > far. Good.
> > 
> > Will ask my friend to reboot the system a dozen times tonight.
> 
> After a few reboots, the issue reappeared. I really believe now, that by
> driving the md in degraded mode for some time and with the switching behavior, 
> just re-adding the devices resulted in unsynced raid1 devices.
> 
> Next, my friend managed to create a nearby data disaster: I've explained him,
> how he would be able to re-add a device himself. He did so on sunday with his
> home partition, and since there appeared no progress bar in /proc/mdstat, he 
> immediately repeated the command. 
> 
> Neil, is it conceivable (due to a race or the like), that repeating to add 
> (re-add) a device potentially creates data salad, since that home-fs (xfs) 
> gone mad a few minutes later (firefox crashed, and couldn't be started, kmail 
> crashed, and so on (all those processes, that write to ~). He decided to 
> reboot, and that jailed him in the emergency recovery console, because /home 
> couldn't be mounted anymore.

There was a bug prior to 2.6.37 (fixed by commit 1a855a0606653d2) which
sounds vaguely related, but you seem to be running a 3.0.x kernel(?) so
shouldn't be affected by that.

Without logs of precisely what happened, it is very hard to guess.



> Today, I hammered the raid1 partitions with "check". During one run, this 
> appeared in syslog:
> 
> Feb  4 11:18:26 zaphkiel kernel: [11165.652478] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
> Feb  4 11:18:26 zaphkiel kernel: [11165.652486] ata2.00: irq_stat 0x40000008
> Feb  4 11:18:26 zaphkiel kernel: [11165.652495] ata2.00: failed command: READ FPDMA QUEUED
> Feb  4 11:18:26 zaphkiel kernel: [11165.652510] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in
> Feb  4 11:18:26 zaphkiel kernel: [11165.652513]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
> Feb  4 11:18:26 zaphkiel kernel: [11165.652520] ata2.00: status: { DRDY ERR }
> Feb  4 11:18:26 zaphkiel kernel: [11165.652524] ata2.00: error: { UNC }
> Feb  4 11:18:26 zaphkiel kernel: [11165.652876] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x100)
> Feb  4 11:18:26 zaphkiel kernel: [11165.652882] ata2.00: revalidation failed (errno=-5)
> Feb  4 11:18:26 zaphkiel kernel: [11165.652890] ata2: hard resetting link
> Feb  4 11:18:26 zaphkiel kernel: [11165.957043] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> Feb  4 11:18:26 zaphkiel kernel: [11165.969910] ata2.00: configured for UDMA/133
> Feb  4 11:18:26 zaphkiel kernel: [11165.970048] ata2: EH complete
> Feb  4 11:18:28 zaphkiel kernel: [11167.949241] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
> Feb  4 11:18:28 zaphkiel kernel: [11167.949249] ata2.00: irq_stat 0x40000008
> Feb  4 11:18:28 zaphkiel kernel: [11167.949257] ata2.00: failed command: READ FPDMA QUEUED
> Feb  4 11:18:28 zaphkiel kernel: [11167.949272] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in
> Feb  4 11:18:28 zaphkiel kernel: [11167.949275]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
> Feb  4 11:18:28 zaphkiel kernel: [11167.949282] ata2.00: status: { DRDY ERR }
> Feb  4 11:18:28 zaphkiel kernel: [11167.949287] ata2.00: error: { UNC }
> Feb  4 11:18:28 zaphkiel kernel: [11167.962146] ata2.00: configured for UDMA/133
> Feb  4 11:18:28 zaphkiel kernel: [11167.962206] ata2: EH complete
> Feb  4 11:18:30 zaphkiel kernel: [11169.898187] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
> Feb  4 11:18:30 zaphkiel kernel: [11169.898195] ata2.00: irq_stat 0x40000008
> Feb  4 11:18:30 zaphkiel kernel: [11169.898204] ata2.00: failed command: READ FPDMA QUEUED
> Feb  4 11:18:30 zaphkiel kernel: [11169.898219] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in
> Feb  4 11:18:30 zaphkiel kernel: [11169.898222]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
> Feb  4 11:18:30 zaphkiel kernel: [11169.898229] ata2.00: status: { DRDY ERR }
> Feb  4 11:18:30 zaphkiel kernel: [11169.898234] ata2.00: error: { UNC }
> Feb  4 11:18:30 zaphkiel kernel: [11169.912066] ata2.00: configured for UDMA/133
> Feb  4 11:18:30 zaphkiel kernel: [11169.912117] ata2: EH complete
> Feb  4 11:18:32 zaphkiel kernel: [11171.905192] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
> Feb  4 11:18:32 zaphkiel kernel: [11171.905200] ata2.00: irq_stat 0x40000008
> Feb  4 11:18:32 zaphkiel kernel: [11171.905208] ata2.00: failed command: READ FPDMA QUEUED
> Feb  4 11:18:32 zaphkiel kernel: [11171.905223] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in
> Feb  4 11:18:32 zaphkiel kernel: [11171.905226]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
> Feb  4 11:18:32 zaphkiel kernel: [11171.905233] ata2.00: status: { DRDY ERR }
> Feb  4 11:18:32 zaphkiel kernel: [11171.905238] ata2.00: error: { UNC }
> Feb  4 11:18:32 zaphkiel kernel: [11171.919099] ata2.00: configured for UDMA/133
> Feb  4 11:18:32 zaphkiel kernel: [11171.919152] ata2: EH complete
> Feb  4 11:18:34 zaphkiel kernel: [11173.912191] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
> Feb  4 11:18:34 zaphkiel kernel: [11173.912199] ata2.00: irq_stat 0x40000008
> Feb  4 11:18:34 zaphkiel kernel: [11173.912208] ata2.00: failed command: READ FPDMA QUEUED
> Feb  4 11:18:34 zaphkiel kernel: [11173.912223] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in
> Feb  4 11:18:34 zaphkiel kernel: [11173.912226]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
> Feb  4 11:18:34 zaphkiel kernel: [11173.912233] ata2.00: status: { DRDY ERR }
> Feb  4 11:18:34 zaphkiel kernel: [11173.912238] ata2.00: error: { UNC }
> Feb  4 11:18:34 zaphkiel kernel: [11173.925101] ata2.00: configured for UDMA/133
> Feb  4 11:18:34 zaphkiel kernel: [11173.925159] ata2: EH complete
> Feb  4 11:18:36 zaphkiel kernel: [11175.861152] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
> Feb  4 11:18:36 zaphkiel kernel: [11175.861160] ata2.00: irq_stat 0x40000008
> Feb  4 11:18:36 zaphkiel kernel: [11175.861168] ata2.00: failed command: READ FPDMA QUEUED
> Feb  4 11:18:36 zaphkiel kernel: [11175.861183] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in
> Feb  4 11:18:36 zaphkiel kernel: [11175.861186]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
> Feb  4 11:18:36 zaphkiel kernel: [11175.861193] ata2.00: status: { DRDY ERR }
> Feb  4 11:18:36 zaphkiel kernel: [11175.861198] ata2.00: error: { UNC }
> Feb  4 11:18:36 zaphkiel kernel: [11175.874052] ata2.00: configured for UDMA/133
> Feb  4 11:18:36 zaphkiel kernel: [11175.874103] sd 1:0:0:0: [sdb] Unhandled sense code
> Feb  4 11:18:36 zaphkiel kernel: [11175.874109] sd 1:0:0:0: [sdb]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Feb  4 11:18:36 zaphkiel kernel: [11175.874117] sd 1:0:0:0: [sdb]  Sense Key : Medium Error [current] [descriptor]
> Feb  4 11:18:36 zaphkiel kernel: [11175.874125] Descriptor sense data with sense descriptors (in hex):
> Feb  4 11:18:36 zaphkiel kernel: [11175.874130]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
> Feb  4 11:18:36 zaphkiel kernel: [11175.874145]         0c c2 ef 3f 
> Feb  4 11:18:36 zaphkiel kernel: [11175.874153] sd 1:0:0:0: [sdb]  Add. Sense: Unrecovered read error - auto reallocate failed
> Feb  4 11:18:36 zaphkiel kernel: [11175.874163] sd 1:0:0:0: [sdb] CDB: Read(10): 28 00 0c c2 ef 12 00 00 80 00
> Feb  4 11:18:36 zaphkiel kernel: [11175.874180] end_request: I/O error, dev sdb, sector 214101823
> Feb  4 11:18:36 zaphkiel kernel: [11175.874234] ata2: EH complete
> Feb  4 11:18:38 zaphkiel kernel: [11177.954091] md: md124: data-check done.
> 
> This is a classical URE, isn't it? Interestingly, nonetheless, the raid1 check 
> run succeeded! (Not so good, is it?)

What?  Success not good?  :-)

md didn't report any errors.  Maybe it didn't see any.  Where is sector
214101823?



> Last question: since I had to massage the system anyway, I've updated mdadm 
> from 3.2.2 to 3.2.6. I red, that it can be dangerous to do so, what do I risk
> here?

Where did you read that?
If you find you need to re-create an array with "--create", a different mdadm
might give a different result, which wouldn't be what you want.  Otherwise it
should be safe.

NeilBrown

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux