Re: Split-Brain Protection for MD arrays

Vincent Pelletier <plr.vincent@xxxxxxxxx> · Mon, 12 Dec 2011 21:18:28 +0100

Le lundi 12 décembre 2011 19:51:23, vous avez écrit :
> split-brain

I'm participating on the NEO[1] project (object database server with 
redundancy - that last bit is the one relevant to this discussion), which 
faces the same kind of problem (storage nodes dying when cluster is functional 
or not, dead nodes comming back to life later, etc). So we had to design some 
counter measures to handle split-brain. 

I'm happy to recognise some equivalent of the decisions we took on NEO, and 
I'll be following this thread with attention (we didn't try to get a lot of 
reviewing on our design so far).

I would suggest one thing:
Use a fixed increment for "metadata version" number. Time representation is
not reliable IMHO, especially at times when you need to setup an array:
faulty BIOS battery, old RTC drifting either way, no NTP to correct this
(either none available or no client to access one).
If timestamp is affected by timezone (and especially DST) makes matters
worse.
Admitedly, fixed increment exposes user to problems if he decides to
independently run two halves of a split brain, start making their data
diverge, reach a point (controlable) where version number is at some
convenient value and then let the array assemble itself and burst in fire.
Though, user has to jump through hoops to reach this. Timestamp-based
requires non-monotonous RTC.

Side note: if anyone knows a time source available to userland which is not
affected by date/ntpd/ntpdate nor timezones nor DST (but can drift when 
computer is powered down - but if possible not when suspended), please tell 
me.

[1] http://pypi.python.org/pypi/neoppod

Regards,
-- 
Vincent Pelletier
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html