Re: filesystem corruption

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sent: Sun Jan 02 2011 21:56:30 GMT-0700 (Mountain Standard Time)
From: Neil Brown <neilb@xxxxxxx>
To: Patrick H. <linux-raid@xxxxxxxxxxxx> linux-raid@xxxxxxxxxxxxxxx
Subject: Re: filesystem corruption
On Sun, 02 Jan 2011 21:06:52 -0700 "Patrick H." <linux-raid@xxxxxxxxxxxx>
wrote:


That makes sense assuming that MD acknowleges the write once the data is written to the data disks but not necessarily the parity disk, which is what I gather you were saying is what happens. Is there any option that can change the behavior so that md wont ack the write until its been committed to all disks (I'm guessing no since you didnt mention it)? Also does raid6 suffer this problem? Is it smart enough to use both parity disks when calculating replacement, or will it just use one?


md/raid5 doesn't acknowledge the write until both the data and the parity
have been written.  But that doesn't make any difference.
If you schedule a number of interdependent writes (data and parity) and then
allow some to complete but not all, then you have inconsistency.
Recovery from losing a single device requires consistency of parity and data.

RAID6 suffers equally from this problem.  Even if it used both parity disks
to recover (which it doesn't) how would that help?  It would then have two
possible value for the data and no way to know which was correct, and every
possibility that both are incorrect.  This would happen if a single data
block was successfully written, but neither parity blocks were.

The only way you can avoid this 'write hole' is by journalling in multiples
of whole stripes.  No current filesystems that I know of can do this as they
journal in blocks, and the maximum block size is less than the minimum stripe
size.  So you would need journalling integrated with md/raid, or you would
need a filesystem which was designed to understand this problem and write
whole stripes at a time, always to an area of the device which did not
contain live data.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ok, thanks for the info.
I think I'll solve it by creating 2 dedicated hosts for running the array, but not actually export any disks themselves. This way if a master dies, all the raid disks are still there and can be picked up by the other master.

-Patrick
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux