On 02/02/2013 06:08 PM, Chris Murphy wrote: > > On Feb 2, 2013, at 2:56 PM, Michael Ritzert <ksciplot@xxxxxxx> > wrote: >> >> Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote: >>> >>> Nevertheless, over an hour and a half is a long time if the file >>> system were being updated at all. There'd definitely be >>> data/parity mismatches for disk1. >> >> After disk1 failed, the only write access should have been >> metadata update when the filesystem was mounted. This is significant. > Was it mounted ro? > >> I only read data from the filesystem thereafter. So only atime >> changes are to be expected, there, and only for a small number of >> files that I could capture before disk3 failed. I know which files >> are affected, and could leave them alone. > > Even for a small number of files there could be dozens or hundreds > of chunks altered. I think conservatively you have to consider disk > 1 out and mount in degraded mode. > >> >>> If disk 1 is assumed to be useless, meaning force assemble the >>> array in degraded mode; a URE or linux SCSI layer time out is to >>> be avoided or the array as a whole fails. Every sector is >>> needed. So what do you think about raising the linux scsi layer >>> time out to maybe 2 minutes, and leaving the remaining drive's >>> SCT ERC alone so that they don't time out sooner, but rather go >>> into whatever deep recovery they have to in the hopes those bad >>> sectors can be read? >>> >>> echo 120 >/sys/block/sdX/device/timeout >> >> I just tried that, but I couldn't see any effect. The error rate >> coming in is much higher than 1 every two minutes. > > This timeout is not about error rate. And what the value should be > depends on context. Normal operation you want the disk error > recovery to be short, so that the disk produces a bonafide URE, not a > SCSI layer timeout error. That way md will correct the bad sector. > That's what probably wasn't happening in your case, which allowed > bad sectors to accumulate until it was a real problem. If you try to recover from the degraded array, this is the correct approach. > But now, for the cloning process, you want the disk error timeout to > be long (or disabled) so that the disk has as long as possible to do > ECC to recover each of these problematic sectors. But this also > means getting the SCSI layer timeout set to at least 1 second longer > than the longest recovery time for the drives, so that the SCSI layer > time out doesn't stop sector recovery during cloning. Now maybe the > disk still won't be able to recover all data from these bad sectors, > but it's your best shot IMO. For the array assembled degraded (disk1 left out). >> When I assemble the array, I will have all new disks (with good >> smart selftests...), so I wouldn't expect timeouts. Instead, junk >> data will be returned from the sectors in question¹. How will md >> react to that? > > Well yeah, with the new drives, they won't report UREs. So there's > an ambiguity with any mismatch between data and parity chunks as to > which is correct. Without a URE, md doesn't know that the data chunk > is right or wrong with RAID 5. Bingo. Working from the copies guarantees you won't have correct data where the UREs are. (The copies are very good to have, of course.) > Phil may disagree, and I have to defer to his experience in this, > but I think the most conservative and best shot you have at getting > the 20GB you want off the array is this: I do disagree. The above, combined with: > I do know where the bad sectors are from the ddrescue report. We are > talking about less that 50kB bad data on disk1. Unfortunately, disk3 > is worse, but there is no sector that is bad on both disks. Leads me to recommend "mdadm --create --assume-clean" using the original drives, taking care to specify the devices in the proper order (per their "Raid Device" number in the --examine reports). I still haven't seen any data that definitively links specific serial numbers to specific raid device numbers. Please do that. After re-creating the array, and setting all the drive timeouts to 7.0 seconds, issue a "check" scrub: echo "check" >/sys/block/md0/md/sync_action This should clean up the few pending sectors on disk #1 by reconstruction from the others, and may very well do the same for disk #3. If disk #3 gets kicked out at this point, assemble in degraded mode with disk #2, #4, and a fresh copy of disk #1 (picking up the new superblock and any fixes during the partial scrub). Then "--add" a spare (wiped) disk and let the array rebuild. And grab your data. Phil. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html