Re: RAID1 sometimes have different data on the slave devices

Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxxx> · Fri, 24 Aug 2018 18:23:55 +0200

On 8/20/18, NeilBrown <neilb@xxxxxxxx> wrote:
> On Fri, Aug 17 2018, Danil Kipnis wrote:
>
>>>>> > On 08/11/2018 02:06 AM, NeilBrown wrote:
>>>>> >> It might be expected behaviour with async direct IO.
>>>>> >> Two threads writing with O_DIRECT io to the same address could
>>>>> >> result in
>>>>> >> different data on the two devices.  This doesn't seem to me to be a
>>>>> >> credible use-case though.  Why would you ever want to do that in
>>>>> >> practice?
>>>>> >>
>>>>> >> NeilBrown
>>>>> >
>>>>> >   My only thought is while the credible case may be weak, if it is
>>>>> > something
>>>>> > that can be protected against with a few conditionals to prevent the
>>>>> > different
>>>>> > data on the slaves diverging -- then it's worth a couple of
>>>>> > conditions to
>>>>> > prevent the nut that know just enough about dd from confusing
>>>>> > things....
>>>>>
>>>>> Yes, it can be protected against - the code is already written.
>>>>> If you have a 2-drive raid1 and want it to be safe against this
>>>>> attack,
>>>>> simply:
>>>>>
>>>>>   mdadm /dev/md127 --grow --level=raid5
>>>>>
>>>>> This will add the required synchronization between writes so that
>>>>> multiple writes to the one block are linearized.  There will be a
>>>>> performance impact.
>>>>>
>>>>> NeilBrown
>>>> Thanks for your comments, Neil.
>>>> Convert to raid5 with 2 drives will not only  cause perrormance drop,
>>>> will also disable the redundancy.
>>>> It's clearly a no go.
>>>
>>>I don't understand why you think it would disable the redundancy, there
>>>are still two copies of every block.  Both RAID1 and RAID5 can survive a
>>>single device failure.
>>>
>>>I agree about performance and don't expect this would be a useful thing
>>>to do; it just seemed the simplest way to explain the cost that would be
>>>involved in resisting this attack.
>>>
>>>NeilBrown
>>
>> Hi Neil,
>>
>> the performance imact one is facing when running raid5 on top of two legs
>> -
>> is it only due to the tracking of the inflight writes or is the raid5
>> actually
>> doing some XORing (with zeros?) in that case? And if the cpu is burned
>> also
>> for some other reason apart of tracking, do you think it would make sence
>> to expose that "writes-to-the-same-sector-tracking" functionality also
>> for
>> raid1 personality?
>
> There would be some performance impact due to extra work for the CPU,
> but I doubt it would be much - CPUs are fast these days.
>
> With RAID1, the data isn't copied.  With RAID5 it is - it is copied
> twice, once from the fs or user-space buffer into the stripe-cache, and
> once from 'data' to 'parity' slots in the stripe cache.
> This copying would cause some of the slowdown - memory isn't as fast as
> CPUs.
>
> Also all requests are divided into multiple 4K requests as they pass
> through the RAID5 stripe-cache.  They should get recombined, but this is
> quite a bit of work and often results in more smaller requests, which
> doesn't make such good use of the underlying device.
>
> Finally there is the synchronization overhead of taking locks to make
> sure that requests don't overlap.  With lots of CPUs and v.fast devices,
> this can be quite significant.
>
> Would it make sense to track writes to the same sector in RAID1?
> Given the justification provided so far,  I would say "no".
> The only use-case that has been demonstrated is performance testing, and
> performance testing isn't hurt by data being different on different
> devices.
> I really think that any genuine use case that cared about data
> consistency would never write data in a way that could result in these
> inconsistencies.
>
> However ... if you have a RAID1 configured with write-behind and a
> write-mostly device, then it is possible that a similar problem could
> arise.
>
> If the app or fs writes to the same block twice in quick
> succession - waiting for the first write to complete before issuing the
> second write - then the data on non-write-mostly devices will reflect
> the second write as expected.
> However the two writes to the write-mostly device could be re-ordered
> and you could end up with the data from the first write remaining on the
> write-mostly device.  This is rather unlikely, but is theoretically
> possible.
> As writes to write-mostly devices are expected to be slow and already
> involve a data copy, adding some synchronization probably wouldn't hurt
> much.
>
> So it might make sense to add some sort of filter to delay overlapping
> writes to the write-mostly devices.  Once that were done, it might not
> be completely pointless to enable the same filter for all writes if
> someone really wanted to slow down their raid1 for very little real
> gain.
>
> NeilBrown
>
>

Hi Neil,

Thanks a lot for the detailed explanation!

Danil Kipnis