> Am 11.03.2010 13:25, schrieb Iain Rauch: >>> On Thu, Mar 11, 2010 at 3:51 AM, Iain Rauch >>> <groups@xxxxxxxxxxxxxxxxxxxxxx> wrote: >>>> Smartd emailed me to say I have "1 Currently unreadable (pending) sectors". >>>> This actually happened for two disks now. >>>> >>>> I ran a check and then a repair on my array and they both gave mismatch_cnt >>>> of 8. >>>> >>>> I ran a long self-test on both and they completed without error with no >>>> errors logged. Yet the 'Current_Pending_Sector' is still 1 on both, and one >>>> disk also has a 'UDMA_CRC_Error_Count' of 1. >>>> >>>> I ran 'hdrecover' on both and they are both telling me "Couldn't recover >>>> sector 2930277168". It's asking if I want to overwrite it with zeros to fix >>>> it, but I would assume this will damage my array? >>>> >>>> The disk sizes are 1500301910016 bytes and I use 1500250M partition sizes >>>> for the array components. Does that sector fall outside my partition, and >>>> hence would it be safe to overwrite it with zeros? >>>> >>>> Also, why did I have a mismatch_cnt? I haven't run another check since I >>>> did >>>> the repair, as I wanted to fix the pending sector. >>>> >>>> BTW, I have a 15 drive RAID6. >>>> >>> >>> If you are running RAID6 and it can read from all but two drives then >>> it should still be able to calculate whatever would match the >>> remaining (presumed good) reads to fill the later two drives. RECENT >>> kernels will try to write over failed sectors automatically; and only >>> kick the drive if the write fails. >>> >>> Please provide more information. >>> >>> Kernel version >>> mdadm version >>> >>> Information about how the source block devices are split up before >>> mdadm sees them, and any related messages from the system-log. The >>> relevant section should be near the end of a dmesg output when you've >>> just completed a check or repair. Your syslog probably already >>> captured the same data and stored it elsewhere. >> >> I thought doing the repair was supposed to fix the issue, but it didn't seem >> to touch it. I wonder if it is outside what md sees, but then how would it >> have been noticed as unreadable? And is it coincidence that both drives have >> the same unreadable sector? >> >> root@Edna:/home/iain# uname -a >> Linux Edna 2.6.28-16-server #57-Ubuntu SMP Wed Nov 11 10:34:04 UTC 2009 >> x86_64 GNU/Linux >> root@Edna:/home/iain# mdadm -V >> mdadm - v2.6.9 - 10th March 2009 >> >> I paste the end of messages below. There's loads of that all the way through >> doing the repair so I'm not sure how to filter out the useful bits. >> >> >> Iain >> [...] > > Hi Iain, > > the "Current_pending_sectors" is a smart attribute which gets > incremented during online (reading and writing sectors) AND offline > drive scanning (also called SMART Data Collection), when the drive finds > out a sector cannot be correctly read at the first try (offline data > collection) or after applying various error-correction techniques. > The easiest way to get rid of this problem: dd a sector of zeros onto > the broken sector, then fail the drive, re-add it. Now wait until the > resync is done. > The fact I'm not sure about is: should one fail and re-add both drives > at once? As by that the redundancy would get lost... > > Speaking about redundancy: our rule of thumb (at xtivate.de) is "each 4 > drives need one redundancy" - so a redundancy of 2 with 15 drives is > kind of playing with your luck... > > Good luck, > Stefan Well, I failed one of the drives and allowed 'hdrecover' to overwrite the unreadable sector, but it still couldn't fix it. Here's its report: Wiping sector 2930277168... Checking sector is now readable... I still couldn't read the sector! I'm sorry, but even writing to the sector hasn't fixed it - there's nothing more I can do! Summary: 1 bad sectors found of those 0 were recovered and 1 could not be recovered and were destroyed causing data loss The 'Current_Pending_Sector' was still 1, so I dd zero onto the whole drive. I guess I could have just done part of it, but I suppose that verified the whole drive 'works'. It only took ~5 hours. Funnily enough this did fix the Current_pending_sectors count back to zero. Still no error reports in the SMART data, and 'Reallocated_Event_Count' didn't go up - shouldn't that have gone up to one? I re-partitoned and added it to the array and it rebuilt fine in ~12 hours. Repeated the process with the second drive and everything's back to normal. The drive that had the 'UDMA_CRC_Error_Count' still says 1, but I don't think I need to worry about that? In direct reply to Stefan: I think you meant to dd zeros onto the drive /after/ failing it - would have caused corruption otherwise? I definitely think it made sense to do one at a time. One parity drive for every four seems a bit extreme, especially when you have a backup (which I don't). I'm fairly happy with 15 drives in RAID 6. I had 24 drives before, and that did give me a few problems :p Just need to keep the drives healthy. (Array scrubs, SMART tests etc). Iain -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html