On Mon, 31 Jan 2011 12:56:36 -0800 Colin McCabe <cmccabe@xxxxxxxxxxxxxx> wrote: > Case #1: > The hard disk where that the FileStore is reading from could be dying. > In my experience, hard disks that are dying will tend to experience > long delays in reading from the filesystem. Occasionally you will be > unable to read some files, and you'll get EIO instead. When a hard > disk is dying, all you want to do is get your data off there as soon > as possible. You don't want to bother trying to "fix" the files on the > disk. That disk is toast. > I'm my experience, with recent disks not every read error means that the disk is going to die anytime soon. I manage several dozens of Western Digital Drives (Caviar black 2TB) in linux raid6 arrays. When running MD array background check, MD will report a read error from time to time on some drives. It will recover the data for that block and rewrite it - but the "bad block" won't show as Reallocated or Pending in SMART reports for that drive. Later, the same drive will do several entire background checks just fine and will go some time before acting up again. I also have seen some big Hitachi drives throwing some uncorrected errors (but reallocating them, unlike WD drives), but otherwise work just fine for months. So, granted, I may have flaky drives, but since they currently are not causing significant hangs or timeouts on the array, why should I just replace all of them? Even a flaky drive is a useful drive if it contains a known good copy of your blocks for some time, just in case your other good drive dies at the wrong time. So, I do agree that, as Brian Chrisman pointed out, background scrub is always important as it helps to prevent your data redundancy going bad without you knowing about it. I also agree with that sys. admin. notification is important in either case. But I also think that Ceph should try to correct the errors it finds through scrub, because some of today's drives may throw uncorrected errors even if they are still useful - I'd rather have more copies of my data, even if they're slightly unrealiable, since I should always be able to tell the bad ones by BTRFS checksums. Besides, I think this model of always trying to correct errors fits well with Ceph's goal of working with unrealiable, comodity hardware, so it makes no sense to just bail out and force the operator to swap every flaky drive. Best regards ClÃudio -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html