Re: Question about potential data consistency issues when writes failed in mdadm raid1

Asaf Levy <asaf@xxxxxxxxxxxx> · Sun, 19 Mar 2023 13:31:17 +0200

Thank you for the clarification.

To make sure I fully understand.
An application that requires consistency should use O_DIRECT and
enforce an R/W lock on top of the mirrored device?

Asaf

On Sun, Mar 19, 2023 at 11:55 AM Geoff Back <geoff@xxxxxxxxxxxxxxx> wrote:
>
> Hi Asaf,
>
> Yes, in principle there are all sorts of cases where you can perform a
> read of newly written data that is not yet on the underlying disk and
> hence the possibility of reading the old data following recovery from an
> intervening catastrophic event (such as a crash).  This is a fundamental
> characteristic of write caching and applies with any storage device and
> any write operation where something crashes before the write is complete
> - you can get this with a single disk or SSD without having RAID in the
> mix at all.
>
> The correct and only way to guarantee that you can never have your
> "consistency issue" is to flush the write through to the underlying
> devices before reading.  If you explicitly flush the write operation
> (which will block until all writes are complete on A, B, M) and the
> flush completes successfully, then all reads will be of the new data and
> there is no consistency issue.
>
> Your scenario describes a concern for the higher level code, not in the
> storage system.  If your application needs to be absolutely certain that
> even after a crash you cannot end up reading old data having previously
> read new data then it is the responsibility of the application to flush
> the writes to the storage before executing the read.  You would also
> need to ensure that the application cannot read from the data between
> write and flush; there's several different ways to achieve that
> (O_DIRECT may be helpful).  Alternatively, you might want to look at
> using something other than the disk for your data interchange between
> processes.
>
> Regards,
>
> Geoff.
>
> Geoff Back
> What if we're all just characters in someone's nightmares?
>
> On 19/03/2023 09:13, Asaf Levy wrote:
> > Hi John,
> >
> > Thank you for your quick response, I'll try to elaborate further.
> > What we are trying to understand is if there is a potential race
> > between reads and writes when mirroring 2 devices.
> > This is unrelated to the fact that the write was not acked.
> >
> > The scenario is: let's assume we have a reader R and a writer W and 2
> > MD devices A and B. A and B are managed under a device M which is
> > configured to use A and B as mirrors (RAID 1). Currently, we have some
> > data on A and B, let's call it V1.
> >
> > W issues a write (V2) to the managed device M
> > The driver sends the write both to A and B at the same time.
> > The write to device A (V2) completes
> > R issues a read to M which directs it to A and returns the result (V2).
> > Now the driver and device A fail at the same time before the write
> > ever gets to device B.
> >
> > When the driver recovers all it is left with is device B so future
> > reads will return older data (V1) than the data that was returned to
> > R.
> >
> > Thanks,
> > Asaf
> >
> > On Fri, Mar 17, 2023 at 10:58 PM John Stoffel <john@xxxxxxxxxxx> wrote:
> >>>>>>> "Ronnie" == Ronnie Lazar <ronnie.lazar@xxxxxxxxxxxx> writes:
> >>> I'm trying to understand how mdadm protects against inconsistent data
> >>> read in the face of failures that occur while writing to a device that
> >>> has raid1.
> >> You need to give a better test case, with examples.
> >>
> >>> Here is the scenario: I have set up raid1 that has 2 mirrors. First
> >>> one is on local storage and the second is on remote storage.  The
> >>> remote storage mirror is configured with write-mostly.
> >> Configuration details?  And what is the remote device?
> >>
> >>> We have parallel jobs: 1 writing to an area on the device and the
> >>> other reading from that area.
> >> So you create /dev/md9 and are writing/reading from it, then the
> >> system crashes and you lose the local half of the mirror, right?
> >>
> >>> The write operation writes the data to the first mirror, and at that
> >>> point the read operation reads the new data from the first mirror.
> >> So how is your write succeeding if it's not written to both halves of
> >> the MD device?  You need to give more details and maybe even some
> >> example code showing what you're doing here.
> >>
> >>> Now, before data has been written to the second (remote) mirror a
> >>> failure has occurred which caused the first machine to fail, When
> >>> the machine comes up, the data is recovered from the second, remote,
> >>> mirror.
> >> Ah... some more details.  It sounds like you have a system A which is
> >> writing to a SITE local remote device as well as a REMOTE site device
> >> in the MD mirror, is this correct?
> >>
> >> Are these iSCSI devices?  FibreChannel?  NBD devices?  More details
> >> please.
> >>
> >>> Now when reading from this area, the users will receive the older
> >>> value, even though, in the first read they got the newer value that
> >>> was written.
> >>> Does mdadm protect against this inconsistency?
> >> It shouldn't be returning success on the write until both sides of the
> >> mirror are updated.  But we can't really tell until you give more
> >> details and an example.
> >>
> >> I assume you're not building a RAID1 device and then writing to the
> >> individual devices behind it's back or something silly like that,
> >> right?
> >>
> >> John
> >>
>