On Wed, Jan 7, 2015 at 12:05 AM, David Raffelt <david.raffelt@xxxxxxxxxxxxx> wrote: > Yes, after the 2 disks were dropped I definitely had a working degraded > drive with 5/7 . I only see XFS errors in the kernel log soon AFTER the hot > spare finished syncing. I suggest moving this to the linux-raid@ list and include the following: brief description: e.g. 7 drive raid6 array, 2 drives got booted at some point due to errors, a hotspare starts rebuilding and finishes, then XFS errors appear in the log, and xfs_repair -n results suggest a bad RAID assembly kernel version mdadm version drive model numbers as well as their SCT ERC values mdadm -E for all drives The list can take all of this. I'm not sure if it'll also take a large journal but I'd try it first before using a URL. For the journal, two things: first it's not going back far enough, the problems had already begun and it'd be good to have a lot more context so I'd dig back and find the first indication of a problem, you can use journalctl --since for this. It can take the form: journalctl --since "24 hours ago" or "2015-01-04 12:15:00" Also use the option -o short-monotonic which will use monotonic time, could come in handy, and is more like dmesg output. >> smarctl -l scterc /dev/sdX > > > I'm ashamed to say that this command only works on 1 of the 8 drives since > this is the only enterprise class drive (we are funded by small science > grants). We have been gradually replacing the desktop class drives as they > fail. The errors in your logs are a lot more extensive than what I'm used to seeing in cases of misconfiguration with desktop drives that lack configurable SCT ERC. But the failure is consistent with that common misconfiguration. The problem with desktop drives is the combination of long error recoveries for bad sectors along with a short kernel SCSI command timer. So what happens is the kernel thinks the drive has hung up, and does a link reset. In reality the drive is probably in a so called "deep recovery" but doesn't get a chance to report an explicit read error. An explicit read error includes the affected sector LBA which the md kernel code can then use to rebuild the data from parity and overwrite the bad sector which fixes the problem. However... >> This has to be issued per drive, no shortcut available by specifying >> all letters at once in brackets. And then lastly this one: >> >> cat /sys/block/sd[abcdefg]/device/timeout >> >> Again plug in the correct letters. > > > All devices are set to 30 seconds. This effectively prevents consumer drives from reporting marginally bad blocks. If they're clearly bad, drive ECC reports read errors fairly quickly. If they're fuzzy, then the ECC does a bunch of retries potentially well beyond 30 seconds. I've heard times of 2-3 minutes, which seems crazy but, that's apparently how long it can be before the drive will give up and report a read error. And that read error is necessary for RAID to work correctly. So what you need to do for all drives that do not have configurable SCT ERC, is: echo 180 > /sys/block/sdX/device/timeout That way the kernel will wait up to 3 minutes. The drive will almost certainly report an explicit read error in less than that, and then md can fix the problem by writing over that bad sector. To force this correction actively rather than passively you should schedule a scrub of all arrays: echo check > /sys/block/mdX/md/sync_action You can do this on complete arrays in normal operation. I wouldn't do this on the degraded array though. Consult linux-raid@ and do what's suggested there. >> Right well it's not fore sure toast yet. Also, one of the things >> gluster is intended to mitigate is the loss of an entire brick, which >> is what happened, but you need another 15TB of space to do >> distributed-replicated on your scratch space. If you can tolerate >> upwards of 48 hour single disk rebuild times, there are now 8TB HGST >> Helium drives :-P > > > Just to confirm, we have 3x15TB bricks in a 45TB volume. Don't we need > complete duplication in a distributed-replicated Gluster volume, or can we > get away with only 1 more brick? If you want all the data to be replicated you need double the storage. But you can have more than one volume, such that one has replication and the other doesn't. The bricks used for replication volumes don't both have to be raid6. It could be one raid6 and one raid5, or one raid6 and one raid0. It's a risk assessment. -- Chris Murphy _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs