On 21/02/18 21:09, Michael Metze wrote: > Hello there, > > I am running a RAID 5 consisting of 4 Seagate 4TB NAS Drives ST4000VN000 > for 4 years now. The raid device is "scrubbed" every month using the > "check" function. There was never a problem. The filesystem is a > journaled ext4. > > Last week I added another external backup drive, and after a reboot, I > was missing disk 4 (sdd) of the RAID. It was physically turned on, no > error in the logs, but md0 was degraded. SMART data are fine. I added it > back manually, and since I use a bitmap, it was accepted immediately. I > run a "check" or scrub afterwards which went fine. > This backup is nothing to do with the raid, I presume? Is it on USB because that causes problems for raid? Whatever, if it's not part of the raid then copying TO it should not cause any problems. > Anyway, after some heavy copy actions on the raid, I moved about 1/3 of > the data to the new backup drive, since I do not need it on the RAID. > After another reboot, the mount process failed, reported the fsck was > not clean. I started a fsck, but this one was reporting massive inode > errors ... so I stopped it, to run another "check" on the RAID, which > gave me a mismatch_cnt 9560824, which seems to be quite high. If you've never had any errors before, that really is a lot! > > Right now I can mount the filesystem read-only, but two important > directories, which I didn't touch for almost 2 years are gone. I can not > explain what went wrong. > > I read and understood > https://raid.wiki.kernel.org/index.php/Scrubbing_the_drives > "With a raid-5 array the only thing that can be done when there is an > error is to correct the parity. This is also the most likely error - the > scenario where the data has been flushed and the parity not updated is > the expected cause of problems like this." > > Is there any way to detect which drive has a problem? Of course I > suspect drive 4. How reliable is the repair function of mdadm? I want to > make sure, the RAID integrity is OK before I try to recover data from > the filesystem, which is probably quite a big next step. Otherwise I may > consider to try a repair with one drive 1-3 assembled in the RAID. Okay. Run a SMART test on all the drives, especially drive 4. If you suspect a failed drive, then *DO* *NOT* run a repair, because this is not the normal "corrupt parity" problem - parity is scattered across all drives which means a lot of *data* is corrupted, which means a repair will trash it forever. > > Many many thanks for any hints in understanding the situation. > Michael > Okay, take drive 4 out, do a force-assemble of the other three, and try a check-only fsck. If that says everything is okay, then you know drive 4 is dud. I'll leave you with that for the moment - come back with the results of the SMART and the three-drive fsck. In the meantime, think seriously about going raid-6. You've backed up 1/3 of your 12GB - does that mean you could resize your array as a 8GB raid-6? Or could you add a fifth drive for a 12GB raid-6? Cheers, Wol -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html