Re: RAID1 sometimes have different data on the slave devices

Reindl Harald <h.reindl@xxxxxxxxxxxxx> · Mon, 13 Aug 2018 12:56:06 +0200



Am 12.08.2018 um 14:14 schrieb Danil Kipnis:
> Fio (or some other application like key-value or object database)
> submits two writes which go to the same offset in a file (or block
> device). Since fio is using libaio, _both_ those writes reach md
> layer. Md forwards those writes to each of its legs and waits for
> confirmations to return. On one leg/disk the writes are executed in
> one order and on another leg - the other way round. The order in which
> the writes are executed is decided by some i.e. firmware inside each
> of the two hdds, md has no possibility to enforce the same order on
> each leg. And now you have one value on one leg and another on
> another. Md receives both confirmations of both writes and says the
> user, everything is fine. And the user will read only one of those
> values all the time, at least for md-raid, where read order is static,
> until of course you remove one leg, which contained this value, and
> suddenly user reads the other one.
> To quote Wikipedia on cap theorem, this thing „consistency: Every read
> receives the most recent write or an error“, can not be guaranteed by
> the raid1.
> So Application must enforce it - like ext4 or any journaling file
> system is doing for its meta data. Which means in the most primitive
> way: do not submit two writes at the same time, wait for the first one
> to return, then submit another one

i see no logic here because i expect from a mirror as RAID1/RAID10
identical data on both mirrors without any but/if/or/maybe

"Two threads writing with O_DIRECT io to the same address could result
in different data on the two devices" makes no sense - everything talks
with the RAID1 layer which is a block-device and expected to have alway
the same data on both mirrors - O_DIRECT don't bypass the RAID layer
because it even don't know about the phyiscal disks underneath

if what ever workload (except a hard crash) leads to different data it's
a bug which should be fixed better sooner than later