> > the entire array. The question is particularly pertinent given the fact > the > > growth is going to take nearly 5 days (a lot can happen in 5 days), and > the > > fact the system was having the rather squirrelly issue a few days back > which > > seems - emphasis on SEEMS - to have been resolved by disabling NCQ. > What > > happens if the system kicks a couple of drives, especially if one drive > gets > > kicked, a bunch of data gets written and then a few minutes later > another > > drive gets kicked? In particular, what if neither of the two drives > that > > get kicked are the new drive? > > Well, what happens if two drives get kicked in normal use over the > course of 5 days? Nothing of any consequence, unless it happens in quick succession. When drive A is kicked, if it is spurious, then the drive is simply added back and a resync performed. If the drive actually failed, then it is replaced, and once again a resync is done. Either way, it takes vastly less time than a growth. Assuming at least one of the kicks is not an out-and-out drive failure, then recovering the bulk of the data is fairly easy. That may not be the case with two drives kicked during a growth, since a big chunk of the data on the last drive will be completely missing. What's more, one is left with an array which has neitehr properly N nor N + 1 drives, but is in the process of changing from one to the other. Again, recovering from a failed resync or a sudden non-drive failure (like a power failure or a drive cable being accidentally yanked) is fairly easy. I don't know what will happen if one of the drive cables feeding three of the drives is accidentally yanked. That's why I am asking. > I think you're being overly cautious, and I'll try to > explain why. > The reshape only reduces redundancy during the "critical section". After > that, you're as redundant as usual and can tolerate a drive failure. On > RAID-6, 2 drive failures. Yes, I know. I've experienced a number of issues where two or more drives have been taken offline by md, though. As I say, recovering from this when the array was in a stable configuration is not too difficult, perhaps even without data loss. What happens when the array is taken offline and it has neither properly 7 nor 8 drives is a real question, though. Obviously, if the array can resume its expansion where it left off after a failure event, then it is not an issue, but according to one of the other correspondents, this feature is not available in my version of mdadm. > A reshape should be considerably safer than > doing a resync to a replacement drive, because in the reshape case if > you get bad sectors md can regenerate the data from the parity info. Except that it takes many times longer, significantly increasing the likelihood of such a failure during the event. > Do you regularly run a check on your array? Or have you done one > recently? And does the SMART info on all your drives look OK? These > should be the case before attempting any reshape anyway, Yes, but that did not stop md from halting the array multiple times during resyncs when NCQ was enabled. Disabling NCQ seems to have alleviated the issue, but I have no guarantees it won't happen again during the growth. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html