On Mon, Jan 3, 2011 at 6:16 AM, Neil Brown <neilb@xxxxxxx> wrote: > On Sun, 02 Jan 2011 18:58:34 -0700 "Patrick H." <linux-raid@xxxxxxxxxxxx> > wrote: > >> I've been trying to track down an issue for a while now and from digging >> around it appears (though not certain) the issue lies with the md raid >> device. >> Whats happening is that after improperly shutting down a raid-5 array, >> upon reassembly, a few files on the filesystem will be corrupt. I dont >> think this is normal filesystem corruption from files being modified >> during the shut down because some of the files that end up corrupted are >> several hours old. >> >> The exact details of what I'm doing: >> I have a 3-node test cluster I'm doing integrity testing on. Each node >> in the cluster is exporting a couple of disks via ATAoE. >> I have the first disk of all 3 nodes in a raid-1 that is holding the >> journal data for the ext3 filesystem. The array is running with an >> internal bitmap as well. >> The second disk of all 3 nodes is a raid-5 array holding the ext3 >> filesystem itself. This is also running with an internal bitmap. >> The ext3 filesystem is mounted with 'data=journal,barrier=1,sync'. >> When I power down the node which is actively running both md raid >> devices, another node in the cluster takes over and starts both arrays >> up (in degraded mode of course). >> Once the original node comes back up, the new master re-adds its disks >> back into the raid arrays and re-syncs them. >> During all this, the filesystem is exported through nfs (nfs also has >> sync turned on) and a client is randomly creating, removing, and >> verifying checksums on the files in the filesystem (nfs is hard mounted >> so operations always retry). The client script averages about 30 >> creations/s, 30 deletes/s, and 30 checksums/s. >> >> So, as stated above, every now and then (1 in 50 chance or so), when the >> master is hard-rebooted, the client will detect a few files with invalid >> md5 checksums. These files could be hours old so they were not being >> actively modified. >> Another key point that leads me to believe its a md raid issue is that >> before I had the ext3 journal running internally on the raid-5 array >> (part of the filesystem itself). When I did this, there would >> occasionally be massive corruption. As in file modification times in the >> future, lots of corrupt files, thousands of files put in the >> 'lost+found' dir upon fsck, etc. After I put it on a separate raid-1, >> there are no more invalid modification times, there hasnt been a single >> file added to 'lost+found', and the number of corrupt files dropped >> significantly. This would seem to indicate that the journal was getting >> corrupted, and when it was played back, it went horribly wrong. >> >> So it would seem there's something wrong with the raid-5 array, but I >> dont know what it could be. Any ideas or input would be much >> appreciated. I can modify the clustering scripts to obtain whatever >> information is needed when they start the arrays. > > What you are doing cannot work reliably. > > If a RAID5 suffers an unclean shutdown and is restarted without a full > complement of devices, then it can corrupt data that has not been changed > recently, just as you are seeing. > This is why mdadm will not assemble that array unless you provide the --force > flag which essentially says "I know what I am doing and accept the risk". > > When md needs to update a block in your 3-drive RAID5, it will read the other > block in the same stripe (if that isn't in the cache or being written at the > same time) and then write out the data block (or blocks) and the newly > computed parity block. > > If you crash after one of those writes has completed, but before all of the > writes have completed, then the parity block will not match the data blocks > on disk. Am I understanding right, that in case of hardware controller with bbu, data and parity gonna be written properly ( for locally connected drives of course ) even in case of powerloss and this is the only feature which hardware raid controllers can do and softraid can't ? (well, except some nice features like maxiq - cache on ssd for adaptec controllers and overall write performance expansion because of ram/bbu) > > When you re-assemble the array with one device missing, md will compute the > data that was on the device using the other data block and the parity block. > As the parity and data blocks could be inconsistent, the result could easily > be wrong. > > With RAID1 there is no similar problem. When you read after a crash you will > always get "correct" data. It maybe from before the last write that was > attempted, or after, but if the data was not written recently you will read > exactly the right data. > > This is why the situation improved substantially when you moved the journal > to RAID1. > > The get full improvement, you need to move the data to RAID1 (or RAID10) as > well. > > NeilBrown > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Best regards, [COOLCOLD-RIPN] -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html