On Jun 01, 2007 09:52 -0400, Theodore Tso wrote: > On Fri, Jun 01, 2007 at 02:13:39PM +0200, Andi Kleen wrote: > > Clusters usually have other ways to do this, haven't they? > > Typically they have STONITH too. It's probably too simple minded > > to just replace a real cluster setup which also handles split > > brain and other conditions. So it's purely against mistakes. > > Yes, it's only real value is to protect against Cluster-HA > malfunctions or misconfiguration. While I agree that HA systems _should_ be enough for this, in our experience even with an HA system some people get it wrong (e.g. manually mounting and bypassing HA, HA itself is broken, comms failure, STONITH failure, etc). I agree it is not intended to be a replacement for an HA/STONITH solution, just belt & suspenders that would have saved hundreds of TB of user data in several cases if it were available. We will enable it by default on all of our filesystems, and of course I'd advise anyone in a SAN environment (whether they _intend_ to have shared disk access or not) to enable it also. > > Besides relying on it would seem dangerous because it is not synchronous > > and you could do a lot of damage in 5 seconds. > > Well, the MMP feature is assigned an incompatible feature bit, so a > kernel who doesn't know about MMP will refuse to touch it; and a > kernel which does follow the MMP protocol will check the MMP block > (delaying the mount by 10 seconds) to make sure no other system is > using the block. Correct. There is a "fast path" where it will wait a shorter time during mount if the fs is reported cleanly unmounted. We can't skip the check entirely, because 2 systems might be mounting at the same time. > So aside from being !@#!@ annoying (which is why it will never be the > default), it does work, modulo the problem that without STONITH or any > kind of I/O fencing, we do risk the other system coming back to life > and then modifying the filesystem in parallel. So as everyone has > said, this is not solution that works in isolation, but is really only > a backup. If the kmmpd is not scheduled in more than 10s then it will re-read the block to ensure that the local system is still the one in control. If not, it will ext3_error() and (in our case at least) this will make the client fs read-only. Even if there is some IO leakage from the local client, this is far better than to continue running with 2 systems writing to the same disk. Ideally there would also be a block-layer functionality to fence the IO on the local system (e.g. plug the elevator output, I don't think that there is anything that could be done about IO already submitted to the device), but the function I thought did this (set_device_rdonly()) is only checked at mount time and is useless. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html