Re: RAID1 sometimes have different data on the slave devices

Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxxx> · Wed, 15 Aug 2018 11:22:54 +0200

On Wed, Aug 15, 2018 at 1:59 AM NeilBrown <neilb@xxxxxxxx> wrote:
>
> On Tue, Aug 14 2018, Danil Kipnis wrote:
>
> >>> On 08/11/2018 02:06 AM, NeilBrown wrote:
> >>>> It might be expected behaviour with async direct IO.
> >>>> Two threads writing with O_DIRECT io to the same address could result in
> >>>> different data on the two devices.  This doesn't seem to me to be a
> >>>> credible use-case though.  Why would you ever want to do that in
> >>>> practice?
> >>>>
> >>>> NeilBrown
> >>>
> >>>   My only thought is while the credible case may be weak, if it is something
> >>> that can be protected against with a few conditionals to prevent the different
> >>> data on the slaves diverging -- then it's worth a couple of conditions to
> >>> prevent the nut that know just enough about dd from confusing things....
> >>
> >>Yes, it can be protected against - the code is already written.
> >>If you have a 2-drive raid1 and want it to be safe against this attack,
> >>simply:
> >>
> >>  mdadm /dev/md127 --grow --level=raid5
> >>
> >>This will add the required synchronization between writes so that
> >>multiple writes to the one block are linearized.  There will be a
> >>performance impact.
> >
> > Hi Neil,
> >
> > if I would store all the inflight writes in say an rb-tree by their offsets,
> > look for the offset of each incoming write in the tree and, if it can be found,
> > postpone the write until the one to the same offset returns: would that solve
> > the problem? I mean apart from the performance penalty due to the search, do
> > you think it would cover for the reorder of the writes going to the same sector
> > in theory?
>
> You would need to either:
> 1/ divide each request up into 1-block units or
> 2/ use an interval tree
> as requests can overlap even though they start at different offsets.
>
> RAID5 splits requests up and uses a hash table.

Right. Thanks for the explanation.

> >
> > Thank you,
> >
> > Danil.
> >
> > P.S.
> > When I try to do mdadm /dev/md127 --grow --level=raid5 on my raid1, I get this:
> > mdadm: Sorry, no reshape for RAID-1!
>
> You must have a broken version of mdadm.
> The code in
>    git://git.kernel.org/pub/scm/utils/mdadm/mdadm.git
> does not contain the string "Sorry".

I've used some patched version - my bad, sorry for the noise.

>
> >  What would a raid5 on top of only two drives
> > actually do?
> I don't understand why that is a difficult question.
> What does a RAID5 on top of 3 drives do?
> What does a RAID5 on top of 4 drives do?
> Now generalize to N drives.
> Now set N=2.

I had the naive understanding, that with raid5 one chunk goes to one
drive, another - to the second and the XOR of them - to the third.
Does it mean, with two drives, a chunk and a XOR have to go to the
same drive? Don't mind, I should read the code, I know.

> You cannot set N=1, because then each stipe has N-1 == 0 data drives, so
> there is no data stored, and nothing to use to compute the parity.
> N=2 doesn't have this (or any) problem.
>
> NeilBrown

Best,
Danil