Re: safe segmenting of conflicting changes, and hot-plugging between alternative versions

Doug Ledford <dledford@xxxxxxxxxx> · Mon, 26 Apr 2010 13:11:03 -0400

On 04/23/2010 05:04 PM, Christian Gatzemeier wrote:
> Phillip Susi <psusi <at> cfl.rr.com> writes:
> 
>> when mdadm
>> --incremental sees the second disk claims the first disk is failed, but
>> it is active and working fine in the running array, it should realize
>> that the superblock on the second disk is wrong, and correct it, which
>> would leave the second disk as failed, removed, and neither use the out
>> of sync data on the disk, nor overwrite it with a copy from the first.
> 
> "Correcting the superblocks" of conflicting members, would translate into having
> a defined way to mark those members as composing a segment that contains a known
> alternative version of the array. The earliest an alternative version can be
> detected, and thus be known and marked as such, is on an incident when a
> conflicting segment comes up while another segment of the array is already
> running degraded. (To simply support segments consisting of single raid member
> devices it may be enough if a superblock marking itself as failed would mean it
> is contains conflicting changes. Multi member segments would require segment IDs)
> 
> IMHO all segments with alternative versions can be marked as known on such 
> incidences. However whether the segments containing alternative versions
> continue to be normally assembled when they come up after the incident like
> before, or if they get ignored in favor of the arbitrary first segment of the
> incidence, should be configurable.
> 
> For users that don't need or want to be able to switch between versions of an
> array by simply switching disks in a hot-pluggable manner, and for those
> concerned about a failure mode that may exist and make disks available in an
> alternating manner and them not noticing it all the time until an incident, I
> suggested "AUTO -SINGLE_SEGMENTS_WITH_KNOWN_ALTERNATIVE_VERSIONS".
> 
> In order to manage segments with alternative versions in a hot-plug manner
> however, all segments need to continue to show up under their real array ID, if
> they are connected first or one at a time. (KNOWN_ALTERNATIVE_VERSIONS need to
> be assembled if they come up.) If the segments would be transformed into
> separate arrays the system won't recognize the segment of the array as such and
> not boot or open it correctly any more. And you wouldn't be able to switch
> between versions by switching the disks that are connected.

Actually, I have a feature request that I haven't gotten around to yet
for something similar to this.  It's the ability pause a raid1 array,
causing a member of the array to stop all updates while the rest of the
array operates as normal.  You then do your system updates, do your
testing, and if you decide it was a bad update, then you revert the
paused state of the array and you are back to the state you had prior to
the update.  The basic guidelines that I've worked out for how this must
be done are as follows:

1) Use mdadm to mark a constituent device of an array as a paused member
(add an internal write intent bitmap if no bitmap currently exists and
use bitmap to track changed areas of array).
2) Reboot, pause becomes effective on next assembly (this is because you
want to make sure the pause takes effect at a point in time when the
filesystem is clean, pausing the system while live would be bad).
3) Perform updates, do testing.
4) Either unpause the array, keeping current setup (in which case the
unpause is immediate and you start syncing the current array data to the
paused array member), or unpause --revert, in which case the unpause
does just like the pause did and waits until the next reboot to become
effective for the obvious reason that we can't revert filesystem state
on a live filesystem.
5) If we added a bitmap where none existed before, remove it.

Done.

However, this is fairly orthogonal to the original problem you
mentioned, specifically that mounting to members of a raid1 array
independently can trick them into thinking they are in sync when they
aren't.  The simplest solution to solve that problem would be to add a
generation count to each device's data in each superblock such that if
device B is failed from the array, then the subsequent update to the
superblock on device A would record not only that device B was failed,
but what the generation count was when device B was failed.  On
subsequent reassembly, if device B reappears, and the generation count
on device B does not match the recorded generation count for device B's
failure incident, then refuse to reassemble the devices into the same
array as this would indicate that the arrays have changed independent of
each other.  But that would probably require a superblock version update
to start storing that for each failed device.  Unless Neil could find
some place to stash the data in the current superblock layouts.

-- 
Doug Ledford <dledford@xxxxxxxxxx>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband

Attachment:
signature.asc

Description: OpenPGP digital signature