Re: Split-Brain Protection for MD arrays

Roberto Spadim <roberto@xxxxxxxxxxxxx> · Fri, 16 Dec 2011 11:46:50 -0200

just some points that we shouldn´t forget... thinking like a end user
of mdadm, not as a developer...
a disk fail occur about 1 time after 2 years of heavy use in a desktop sata disk
a complex structure just for 1 minute of mdadm --remove, mdadm --add
should be accepted by end users... it´s just 1 minute of 2 years...
2 years=730 days=17520 hours=1051200 minutes, in other works 1 minute
~= 1/1.000.000=0.0001% of stop time, 99.9999% of online time, if we
consider turn server off add a new disk and remove older, let we
consider 10minutes? 0.001% = 99.999% of online time
it´s well accepted for desktop and servers...

for raid1 and linear- i don´t see a real complex logic telling what
block isn´t ok, just a counter telling what disk have more recent data
is wellcome
for raid10, raid5 and raid6- ok we can allow a block specific ,since
we could consider a bad disk like many bad blocks and many good blocks
(in the good disk)

2011/12/15 NeilBrown <neilb@xxxxxxx>:
> On Thu, 15 Dec 2011 16:29:12 +0200 Alexander Lyakas <alex.bolshoy@xxxxxxxxx>
> wrote:
>
>> Neil,
>> thanks for the review, and for detailed answers to my questions.
>>
>> > When we mark a device 'failed' it should stay marked as 'failed'.  When the
>> > array is optimal again it is safe to convert all 'failed' slots to
>> > 'spare/missing' but not before.
>> I did not understand all that reasoning. When you say "slot", you mean
>> index in the dev_roles[] array, correct? If yes, I don't see what
>> importance the index has, compared to the value of the entry itself
>> (which is "role" in your terminology).
>> Currently, 0xFFFE means both "failed" and "missing", and that makes
>> perfect sense to me. Basically this means that this entry of
>> dev_roles[] is unused. When a device fails, it is kicked out of the
>> array, so its entry in dev_roles[] becomes available.
>> (You once mentioned that for older arrays, their dev_roles[] index was
>> also their role, perhaps you are concerned about those too).
>> In any case, I will be watching for changes in this area, if you
>> decide to make them (although I think this might break backwards
>> compatibility, unless a new version of superblock will be used).
>
> Maybe...  as I said, "confusing" is a relevant word in this area.
>
>>
>> > If you have a working array and you initiate a write of a data block and the
>> > parity block, and if one of those writes fails, then you no longer have a
>> > working array.  Some data blocks in that stripe cannot be recovered.
>> > So we need to make sure that admin knows the array is dead and doesn't just
>> > re-assemble and think everything is OK.
>> I see your point. I don't know what's better: to know the "last known
>> good" configuration, or to know that the array has failed. I guess, I
>> am just used to the former.
>
> Possibly an 'array-has-failed' flag in the metadata would allow us to keep
> the last known-good config.  But as it isn't any good any more I don't really
> see the point.
>
>
>>
>> > I think to resolve this issue we need 2 thing.
>> >
>> > 1/ when assembling an array if any device thinks that the 'chosen' device has
>> >   failed, then don't trust that devices.
>> I think that if any device thinks that "chosen" has failed, then
>> either it has a more recent superblock, and then this device should be
>> "chosen" and not the other. Or, the "chosen" device's superblock is
>> the one that counts, then it doesn't matter what current device
>> thinks, because array will be assembled according to the "chosen"
>> superblock.
>
> This is exactly what the current code does and it allows you to assemble an
> array after a split-brain experience.  This is bad.  Checking what other
> devices think of the chosen device lets you detect the effect of a
> split-brain.
>
>
>>
>> > 2/ Don't erase 'failed' status from dev_roles[] until the array is
>> > optimal.
>>
>> Neil, I think both these points don't resolve the following simple
>> scenario: RAID1 with drive A and B. Drive A fails, array continues to
>> operate on drive B. After reboot, only drive A is accessible. If we go
>> ahead with assemble, we will see stale data. If after reboot, we,
>> however, see only drive A, then (since B is "faulty" in A's
>> superblock), we can go ahead and assemble. The change I suggested will
>> abort in the first case, but will assemble in the second case.
>
> Using --no-degraded will do what you want in both cases.  So no code change
> is needed!
>
>>
>> But obviously, you know better what MD users expect and want.
>
> Don't bet on it.
> So far I have one vote - from you - that --no-degraded should be he default
> (I think that is what you are saying).  If others agree I'll certainly
> consider it more.
>
> Note that "--no-degraded" doesn't exactly mean "not assemble a degraded
> array".  It means "don't assemble an array more degraded that it was last
> time it was working".  i.e. require that all devices that are working
> according to the metadata are actually available.
>
> NeilBrown
>
>
>
>> Thanks again for taking time and reviewing the proposal! And yes, next
>> time, I will put everything in the email.
>>
>> Alex.
>

-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html