Hi Phil, Chris, Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote: > On Feb 2, 2013, at 6:44 AM, Phil Turmel <philip@xxxxxxxxxx> wrote: >> >> All of your drives are in perfect condition (no relocations at all). > > One disk has Current_Pending_Sector raw value 30, and read error rate of 1121. > Another disk has Current_Pending_Sector raw value 2350, and read error rate of 29439. > > I think for new drives that's unreasonable. > > It's probably also unreasonable to trust a new drive without testing it. But some of the drives were tested by someone or something, and the test itself was aborted due to read failures, even though the disk was not flagged by SMART as "failed" or failure in progress. Example: > > # 1 Short offline Completed: read failure 40% 766 329136144 > # 2 Short offline Completed: read failure 10% 745 717909280 > # 3 Short offline Completed: read failure 70% 714 327191864 > # 4 Extended offline Completed: read failure 90% 695 329136144 > # 5 Short offline Completed: read failure 80% 695 724561192 That was probably me manually starting tests. When I first noticed signs of trouble, i.e. slow access, I immediately checked the disk status, and the status page said "OK". I couldn't believe that, so I started unscheduled and extended tests. Would you consider running a full smart selftest on a new disk sufficient? Or do you propose even stricter tests? > Almost 100 hours ago, at least, problems with this disk were identified. Maybe this is a NAS feature limitation problem, but if the NAS is going to purport to do SMART testing and then fail to inform the user that the tests themselves are failing due to bad sectors, that's negligence in my opinion. Sadly it's common. When judging the 100 hours, you have to keep in mind that these disk have been running since the failure. Taking the copy took a few hours (times two by now), and few more hours have been added since it finished at nighttime and the disk stayed on until I got up. Still, that shouldn't add up to 100 hours. >> Based on the event counts in your superblocks, I'd say disk1 was kicked out long ago due to a normal URE (hundreds of hours ago) and the array has been degraded ever since. > > I'm confused because the OP reports disk 1 and disk 4 as sdc3, disk 2 and disk 3 as sdb3; yet the superblock info has different checksums for each. So based on Update Time field, I'm curious what other information leads you to believe disk1 was kicked hundreds of hours ago: The disks are running on a desktop PC at the moment. I can plug in two disks at any time, as I have set things up at the moment. So I had to connect two times two disk to get all four reports. That's why the devices are identical. > disk 1: > Fri Jan 4 15:11:07 2013 > disk 2: > Fri Jan 4 16:33:36 2013 > disk 3: > Fri Jan 4 16:32:27 2013 > disk 4: > Fri Jan 4 16:33:36 2013 > > Nevertheless, over an hour and a half is a long time if the file system were being updated at all. There'd definitely be data/parity mismatches for disk1. After disk1 failed, the only write access should have been metadata update when the filesystem was mounted. I only read data from the filesystem thereafter. So only atime changes are to be expected, there, and only for a small number of files that I could capture before disk3 failed. I know which files are affected, and could leave them alone. > If disk 1 is assumed to be useless, meaning force assemble the array in degraded mode; a URE or linux SCSI layer time out is to be avoided or the array as a whole fails. Every sector is needed. So what do you think about raising the linux scsi layer time out to maybe 2 minutes, and leaving the remaining drive's SCT ERC alone so that they don't time out sooner, but rather go into whatever deep recovery they have to in the hopes those bad sectors can be read? > > echo 120 >/sys/block/sdX/device/timeout I just tried that, but I couldn't see any effect. The error rate coming in is much higher than 1 every two minutes. When I assemble the array, I will have all new disks (with good smart selftests...), so I wouldn't expect timeouts. Instead, junk data will be returned from the sectors in question¹. How will md react to that? Regards, Michael ¹ One could think about filling these gaps with data from the three remaining disks. Disk1 is still uptodate in 99%+ of all chunks. So data from 3 disks is available. I could implement the RAID5 algorithm in userspace to compute what should be in the bad sector. I do know where the bad sectors are from the ddrescue report. We are talking about less that 50kB bad data on disk1. Unfortunately, disk3 is worse, but there is no sector that is bad on both disks. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html