Re: RAID1 sometimes have different data on the slave devices

NeilBrown <neilb@xxxxxxxx> · Mon, 20 Aug 2018 13:39:03 +1000

On Fri, Aug 17 2018, Danil Kipnis wrote:

>>>> > On 08/11/2018 02:06 AM, NeilBrown wrote:
>>>> >> It might be expected behaviour with async direct IO.
>>>> >> Two threads writing with O_DIRECT io to the same address could result in
>>>> >> different data on the two devices.  This doesn't seem to me to be a
>>>> >> credible use-case though.  Why would you ever want to do that in
>>>> >> practice?
>>>> >>
>>>> >> NeilBrown
>>>> >
>>>> >   My only thought is while the credible case may be weak, if it is something
>>>> > that can be protected against with a few conditionals to prevent the different
>>>> > data on the slaves diverging -- then it's worth a couple of conditions to
>>>> > prevent the nut that know just enough about dd from confusing things....
>>>>
>>>> Yes, it can be protected against - the code is already written.
>>>> If you have a 2-drive raid1 and want it to be safe against this attack,
>>>> simply:
>>>>
>>>>   mdadm /dev/md127 --grow --level=raid5
>>>>
>>>> This will add the required synchronization between writes so that
>>>> multiple writes to the one block are linearized.  There will be a
>>>> performance impact.
>>>>
>>>> NeilBrown
>>> Thanks for your comments, Neil.
>>> Convert to raid5 with 2 drives will not only  cause perrormance drop,
>>> will also disable the redundancy.
>>> It's clearly a no go.
>>
>>I don't understand why you think it would disable the redundancy, there
>>are still two copies of every block.  Both RAID1 and RAID5 can survive a
>>single device failure.
>>
>>I agree about performance and don't expect this would be a useful thing
>>to do; it just seemed the simplest way to explain the cost that would be
>>involved in resisting this attack.
>>
>>NeilBrown
>
> Hi Neil,
>
> the performance imact one is facing when running raid5 on top of two legs -
> is it only due to the tracking of the inflight writes or is the raid5 actually
> doing some XORing (with zeros?) in that case? And if the cpu is burned also
> for some other reason apart of tracking, do you think it would make sence
> to expose that "writes-to-the-same-sector-tracking" functionality also for
> raid1 personality?

There would be some performance impact due to extra work for the CPU,
but I doubt it would be much - CPUs are fast these days.

With RAID1, the data isn't copied.  With RAID5 it is - it is copied
twice, once from the fs or user-space buffer into the stripe-cache, and
once from 'data' to 'parity' slots in the stripe cache.
This copying would cause some of the slowdown - memory isn't as fast as
CPUs.

Also all requests are divided into multiple 4K requests as they pass
through the RAID5 stripe-cache.  They should get recombined, but this is
quite a bit of work and often results in more smaller requests, which
doesn't make such good use of the underlying device.

Finally there is the synchronization overhead of taking locks to make
sure that requests don't overlap.  With lots of CPUs and v.fast devices,
this can be quite significant.

Would it make sense to track writes to the same sector in RAID1?
Given the justification provided so far,  I would say "no".
The only use-case that has been demonstrated is performance testing, and
performance testing isn't hurt by data being different on different
devices.
I really think that any genuine use case that cared about data
consistency would never write data in a way that could result in these
inconsistencies.

However ... if you have a RAID1 configured with write-behind and a
write-mostly device, then it is possible that a similar problem could
arise.

If the app or fs writes to the same block twice in quick
succession - waiting for the first write to complete before issuing the
second write - then the data on non-write-mostly devices will reflect
the second write as expected.
However the two writes to the write-mostly device could be re-ordered
and you could end up with the data from the first write remaining on the
write-mostly device.  This is rather unlikely, but is theoretically
possible.
As writes to write-mostly devices are expected to be slow and already
involve a data copy, adding some synchronization probably wouldn't hurt
much.

So it might make sense to add some sort of filter to delay overlapping
writes to the write-mostly devices.  Once that were done, it might not
be completely pointless to enable the same filter for all writes if
someone really wanted to slow down their raid1 for very little real
gain.

NeilBrown

Attachment:
signature.asc

Description: PGP signature