> On 21/02/18 21:09, Michael Metze wrote: >> Hello there, >> >> I am running a RAID 5 consisting of 4 Seagate 4TB NAS Drives ST4000VN000 >> for 4 years now. The raid device is "scrubbed" every month using the >> "check" function. There was never a problem. The filesystem is a >> journaled ext4. >> >> Last week I added another external backup drive, and after a reboot, I >> was missing disk 4 (sdd) of the RAID. It was physically turned on, no >> error in the logs, but md0 was degraded. SMART data are fine. I added it >> back manually, and since I use a bitmap, it was accepted immediately. I >> run a "check" or scrub afterwards which went fine. >> > This backup is nothing to do with the raid, I presume? Is it on USB > because that causes problems for raid? Whatever, if it's not part of the > raid then copying TO it should not cause any problems. Correct. I was reorganizing my backup structure, since my photography directory with raw files was growing to big. There was simply a copy TO USB and e-sata via separate e-sata port. >> Anyway, after some heavy copy actions on the raid, I moved about 1/3 of >> the data to the new backup drive, since I do not need it on the RAID. >> After another reboot, the mount process failed, reported the fsck was >> not clean. I started a fsck, but this one was reporting massive inode >> errors ... so I stopped it, to run another "check" on the RAID, which >> gave me a mismatch_cnt 9560824, which seems to be quite high. > > If you've never had any errors before, that really is a lot! Interestingly, when I lost drive 4 the first time, there was no error during scrub. Unfortunately, I remember losing it a second time. This time, the drive was rebuild - it think this was the moment where errors where introduced. I have a backup of a big directory (150G) which is still accessable and readable on the raid. On your advice - and using an non-destructive overlay file approach - I did 5 comparisons of backup/raid-content using different raid assemblings (diff command). UUUU massive diffs/errors _UUU very few diffs/errors U_UU massive diffs/errors UU_U massive diffs/errors UUU_ no diffs/errors This seems to be proof of a significant problem with drive 4 during rebuild. So drive 4 has wrong data. >> Right now I can mount the filesystem read-only, but two important >> directories, which I didn't touch for almost 2 years are gone. I can not >> explain what went wrong. >> >> I read and understood >> https://raid.wiki.kernel.org/index.php/Scrubbing_the_drives >> "With a raid-5 array the only thing that can be done when there is an >> error is to correct the parity. This is also the most likely error - the >> scenario where the data has been flushed and the parity not updated is >> the expected cause of problems like this." >> >> Is there any way to detect which drive has a problem? Of course I >> suspect drive 4. How reliable is the repair function of mdadm? I want to >> make sure, the RAID integrity is OK before I try to recover data from >> the filesystem, which is probably quite a big next step. Otherwise I may >> consider to try a repair with one drive 1-3 assembled in the RAID. > > Okay. Run a SMART test on all the drives, especially drive 4. Done. Still no errors using smart-ctl -a. Long & Short test performed. Please see https://pastebin.com/Cj3TGYLR > > If you suspect a failed drive, then *DO* *NOT* run a repair, because > this is not the normal "corrupt parity" problem - parity is scattered > across all drives which means a lot of *data* is corrupted, which means > a repair will trash it forever. Understood. I guest since my missing photography folders where not written for almost a year - they should remain intact on drive 1-3. >> Many many thanks for any hints in understanding the situation. >> Michael >> > Okay, take drive 4 out, do a force-assemble of the other three, and try > a check-only fsck. If that says everything is okay, then you know drive > 4 is dud. > > I'll leave you with that for the moment - come back with the results of > the SMART and the three-drive fsck. Unforunately, I ran an incomplete fsck, which I aborted due to massive errors. This action may habe introducted the damage to the file system structure. fsck-results: https://www.dropbox.com/sh/wxfa13ace68edr3/AABBQeapjGlKa70ihMPFkkgGa?dl=0 see File fsck.UUU_ terrible result 69MB the filesystem is journaled, I will try different superblocks later ... but corrupt file system structure causes despair ... When I mount the raid read-only I get results like d????????? ? ? ? ? ? Source d????????? ? ? ? ? ? NikonTransfer after (nondestructive) repair the directories are vanished > In the meantime, think seriously about going raid-6. You've backed u > 1/3 of your 12GB - does that mean you could resize your array as a 8GB > raid-6? Or could you add a fifth drive for a 12GB raid-6? > > Cheers, > Wol I WILL definitely do this. ZFS RAIDZ6 meight be a good option since the parity, filesysten structure and repair is in one hand. Thanks a lot. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html