On Sun, 27 Jun 2010 12:33:49 +0200 John Hendrikx <hjohn@xxxxxxxxx> wrote: > Shaochun Wang wrote: > > Hi: > > > > Recently I heard of the so called "write hole" problem of raid5 in > > Linux software raid. I use ext4 filesystem on my NAS, which assembles > > data disks using Linux software raid. So I wonder how safe my such > > system! > > > > If the "write hole" is inevitable, will it result in the corruption of > > ext4 filesystem? > The write hole occurs if your system crashes during a write operation, > where one stripe gets updated but the other corresponding stripe does > not. This could lead to parity information not matching the > corresponding data. Correct. > > If the raid 5 system atleast ensures that the data stripe is always > written before parity, then the montly resync check that mdadm does > should be able to detect this and write new parity information. This bit isn't so correct. When the RAID5 is next assembled after the crash, if all devices are present (i.e. the array is not degraded) then it will check and correct all the parity blocks immediately. If you have a write-intent-bitmap configured, this will be quite quick. If not it could take hours. Once the resync has completed you are safe again, any risk from the "write hole" will have disappeared. If your array was degraded when the system crashed, or is degraded on restart, or degrades before the resync completes, then you could suffer from the "Write hole" ... if a write was interrupted by the crash. In the first two cases (which are effectively the same case), mdadm will refuse to assemble the array because it knows it could be suffering from a write-hole problem. You need to reassemble with "--force" which means you acknowledge that there could be corruption due to the write hole. If you lose a device during the resync you could still suffer from the write hole, but md doesn't alert you to this. That could be seen as a short-coming, but I'm not sure how it might be fixed. I wouldn't want the array to suddenly stop working because there is suddenly a risk of write-hold based corruption.... > > Atleast this way the bad parity does not lurk around forever on your > raid system causing numerous problems when a disk finally fails. Yes, it certainly does not lurk forever - the resync fixes it. > > The write hole is not inevitable, but would require some special > measures at the raid level which could affect performance. And as with > any corruption, it could definitely corrupt your filesystem. The write hole can be "fixed" in two ways that I am aware of. 1/ log all writes (including parity updates) to some stable storage before writing them to the RAID5. This is typically done in "hardware RAID" cards using NVRAM for the stable storage. Once NVRAM is widely available on commodity server hardware I suspect md/raid5 will be enhanced to support this. I have thought about doing this using a RAID1 as the alternate stable storage, but the performance cost is unlikely to acceptable. 2/ use a filesystem which understands the layout of the RAID5 and which somehow "knows" which stripes were written "recently" so that it can invalidate them (if it cannot verify them) after a crash. This would almost certainly require a copy-on-write disciple in the filesystem. NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html