Good morning Marc, Clément, On 11/17/2015 07:30 AM, Marc Pinhede wrote: > Hello, > > Thanks for your answer. Update since our last mail: We saved many > data thanks to long and boring rsyncs, with countless reboots: during > rsync, sometime a drive was suddenly considered in 'failed' state by > the array. The array was still active (with 13 or 12 / 16 disks) but > 100% of files failed with I/O after that. We were then forced to > reboot, reassemble the array and restart rsync. Yes, a miserable task on a large array. Good to know you saved most (?) of your data. > During those long operation, we have been advised to re-tighten our > storage bay's screws (carri bay). And this is were the magic > happened. After screwing them back on, no more problem with drive > considered failed. We only had 4 file copy failures with I/O, but it > didn't correspond to a drive failing in the array (still working with > 14/16 drives). > We can't guarantee than the problem is fixed, but we moved from about > 10 reboot a day to 5 days of work without problems. Very good news. Finding a root cause for a problem greatly raises the odds future efforts will succeed. > We now plan to reset and re-introduce one by one the two drive that > were not recognize by the array, and let the array synchronize, > rewriting data on those drive. Does it sounds like a good idea to > you, or do you think it may fails due to some errors? Since you've identified a real hardware issue that impacted the entire array, I wouldn't trust it until every drive is thoroughly wiped and retested. Use "badblocks -w -p 2" or similar. Then construct a new array and restore your saved data. [trim /] >> It's very important that we get a map of drive serial numbers to >> current device names and the "Device Role" from "mdadm --examine". >> As an alternative, post the output of "ls -l /dev/disk/by-id/". >> This is critical information for any future re-create attempts. If you look close at the lsdrv output, you'll see it successfully acquired drive serial numbers for all drives. However, they are reported as Adaptec Logical drives -- these might be generated by the adaptec firmware, not the real serial numbers. > It seems that the mapping changes at each reboot (two drives that > host the operating system had different name across reboots). Since > we re-tighten screws, we didn't reboot though. Device names are dependent on device discovery order, which can change somewhat randomly. What I've seen with lsdrv is that order doesn't change within a single controller -- the scsi addresses {host:bus:target:lun} have consistent bus:target:lun for a given port on a controller. I don't have much experience with adaptec devices, so I'd be curious if it holds true for them. >> The rest of the information from smartctl is important, and you >> should upgrade your system to a level that supports it, but it can >> wait for later. Consider compiling a local copy of the latest smartctl instead of using a chroot. Supply the scsi address shown in lsdrv to the -d aacraid, option. Regards, Phil -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html