Re: RAID1 sometimes have different data on the slave devices

Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxxx> · Sun, 12 Aug 2018 14:14:47 +0200

Fio (or some other application like key-value or object database)
submits two writes which go to the same offset in a file (or block
device). Since fio is using libaio, _both_ those writes reach md
layer. Md forwards those writes to each of its legs and waits for
confirmations to return. On one leg/disk the writes are executed in
one order and on another leg - the other way round. The order in which
the writes are executed is decided by some i.e. firmware inside each
of the two hdds, md has no possibility to enforce the same order on
each leg. And now you have one value on one leg and another on
another. Md receives both confirmations of both writes and says the
user, everything is fine. And the user will read only one of those
values all the time, at least for md-raid, where read order is static,
until of course you remove one leg, which contained this value, and
suddenly user reads the other one.
To quote Wikipedia on cap theorem, this thing „consistency: Every read
receives the most recent write or an error“, can not be guaranteed by
the raid1.
So Application must enforce it - like ext4 or any journaling file
system is doing for its meta data. Which means in the most primitive
way: do not submit two writes at the same time, wait for the first one
to return, then submit another one.

On Saturday, August 11, 2018, Reindl Harald <h.reindl@xxxxxxxxxxxxx> wrote:
>
>
>
> Am 11.08.2018 um 11:20 schrieb Danil Kipnis:
> > In the attachment you can find a script that compares md5 sum of a
> > _file_ on top of ext4 on top of raid on each leg after running fio.
> > You will see that after a couple of minutes the file itself where fio
> > is writing to, becomes different on both legs, because ext4 enforced
> > consistency of its metadata, but not data inside files.
> >
> > What we see is that consistency of raid1 is lower as that of a single
> > disk, as per definition of raid1. Raid1 itself can't prevent writes to
> > be reordered inside its legs/disks. The only way for an application to
> > enforce ordering is wait for each write to return, journaling,
> > barriers, etc. Am I right?
>
> i still don't get it:
>
> DIRECT_IO and whatever - on the md-layer i see no single valid reason
> that there is a difference on disk as long there was no hard crash on
> the machine
>
> i also must not matter if ext4 has writeback enabled becahsue ext4 rites
> something to the blockdevice below or not and anything above is the
> md-layer which is the only one knowing how much disks are pyhiscally
> involved
>
> or to say it short:
> it must not matter if DIRECT_IO or some filesystem optimizations are in
> place - the block-layer reperesented by mdraid is one layer below all of
> that
>